Skip to main content
Log in

Datasets, tasks, and training methods for large-scale hypergraph learning

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Relations among multiple entities are prevalent in many fields, and hypergraphs are widely used to represent such group relations. Hence, machine learning on hypergraphs has received considerable attention, and especially much effort has been made in neural network architectures for hypergraphs (a.k.a., hypergraph neural networks). However, existing studies mostly focused on small datasets for a few single-entity-level downstream tasks and overlooked scalability issues, although most real-world group relations are large-scale. In this work, we propose new tasks, datasets, and scalable training methods for addressing these limitations. First, we introduce two pair-level hypergraph-learning tasks to formulate a wide range of real-world problems. Then, we build and publicly release two large-scale hypergraph datasets with tens of millions of nodes, rich features, and labels. After that, we propose PCL, a scalable learning method for hypergraph neural networks. To tackle scalability issues, PCL splits a given hypergraph into partitions and trains a neural network via contrastive learning. Our extensive experiments demonstrate that hypergraph neural networks can be trained for large-scale hypergraphs by PCL while outperforming 16 baseline models. Specifically, the performance is comparable, or surprisingly even better than that achieved by training hypergraph neural networks on the entire hypergraphs without partitioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Code and Data Availability

The source code used in this paper and the large-scale hypergraph datasets that we build are publicly available at https://github.com/kswoo97/pcl for reproducibility.

Notes

  1. For example, in an unsupervised setting without any given positive example pairs, one can additionally split edges in \(\mathcal {E}'\) to use them as positive example pairs for training.

  2. For example, it can be considered in unsupervised settings especially when the clustering membership is strongly correlated with node features and/or topological information.

  3. Available at https://www.aminer.cn/oag-2-1.

  4. Available at https://github.com/UKPLab/sentence-transformers.

  5. Available at https://scholar.google.com/citations?view_op=top_venues &hl=en &vq=eng. In this taxonomy, a single venue is associated with a single sub-field.

  6. PaToH is a balanced partitioning method. It ensures that all generated partitions are of similar sizes (Çatalyürek and Aykanat 2011), specifically satisfying \(\vert \mathcal {P}_{k}^{\mathcal {V}}\vert \le \frac{(1 + \epsilon )}{\vert \mathcal {P}\vert } \sum _{i=1}^{\vert \mathcal {P}\vert } \vert \mathcal {P}_{i}^{\mathcal {V}}\vert , \ \forall k=1,\cdots ,\vert \mathcal {P}\vert\). As shown in Table 8 in Sect. 6.3.5, partitions obtained by PaToH from real-world hypergraphs are well balanced. Specifically, the standard deviation of the number of nodes in each partition is less than 0.5% of the average number of nodes per partition.

  7. One can set K based on the available amount of space (low K takes more memory consumption in general). Note that the performance of the proposed method is not significantly affected by K, which will be demonstrated in Sect. 6.3.5.

  8. Note that other self-supervised losses (e.g., (Addanki et al. 2021)) can be used alternatively.

  9. Since graph encoders require a graph topology as an input, we convert original hypergraphs into graphs by Clique Expansion, described in Appendix A.2.

  10. This is because partitioning algorithms generally assign such nodes in the same partition.

  11. At each mini-batch (partition) of contrastive learning, we record the GPU memory usage after completing the gradient computation (spec., execute loss.backward() and check the current GPU memory allocation using torch.cuda.memory_allocated(device)). After training an encoder in every mini-batch, we calculate the average GPU memory usage for the current epoch by averaging the usage across all partitions. Finally, we compute the average GPU memory usage across all epochs.

  12. The total contrastive training epochs are 50.

References

Download references

Funding

This work was supported by Samsung Electronics Co., Ltd., National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00157, Robust, Fair, Extensible Data-Centric Continual Learning) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kijung Shin.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Responsible editor: Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Additional experimental settings

Appendix: Additional experimental settings

1.1 Details of datasets

DBLP is a co-authorship hypergraph where nodes and hyperedges correspond to publications and authors, respectively. Each publication’s class is labeled according to its field of study. Trivago is a hotel-web search hypergraph where each node indicates each hotel and each hyperedge corresponds to a user. If a user (hyperedge) has visited the website of a particular hotel (node), the corresponding node is added to the respective user hyperedge. Furthermore, each hotel’s class is labeled based on the country in which it is located. OGBN-MAG is originally a heterogeneous graph that contains comprehensive academic information including venue, author, publication, and affiliation information. We transform this heterogeneous graph into a hypergraph as described in Sect. 4, while a label of each node (publication) indicates a published venue of the corresponding publication.

1.2 Details of graph-based baseline methods

Since graph representation models (Kipf and Welling 2017; Veličković et al. 2018) require ordinary graph structure as an input, we transform original hypergraph datasets into ordinary graph datasets by using clique expansion, where each hyperedge is replaced with a clique in the resulting graph. Formally, the clique expansion is a transformation of a given hyperedge set \(\mathcal {E}\) to a clique expanded edge set \(\mathcal {E}_{G} = \bigcup _{e \in \mathcal {E}} \left( {\begin{array}{c}e\\ 2\end{array}}\right)\).

figure b

Specifically, for full-graph datasets of DBLP and Trivago, we directly obtain \(\mathcal {E}_{G}\) from \(\mathcal {E}\), the entire hyperedge set. For full-graph datasets of OGBN-MAG, the size of the resulting clique expanded edges is too large to be loaded into the main memory. To reduce its scale, we additionally employ sampling, as described in Algorithm 2. Specifically, for each hyperedge \(e'\) whose size is greater than k and for each constituent node \(v \in e'\), we uniformly sample k other nodes from \(e'\) (line 7) and create k edges joining v and each of the k sampled nodes. Here, we set \(k=10\) for the OGBN-MAG dataset. We fail to create full-graph datasets of AMiner and MAG since clique expansion runs out of memory even with small k around 3, and thus we cannot perform experiments on them.

For partitioned-graph datasets of DBLP, Trivago, and OGBN-MAG, we apply clique expansion to the hyperedge set in each partition and use the resulting clique-expanded edge set as that of the corresponding partition. For partitioned-graph datasets of AMiner and MAG, due to the scalability issue, we apply the sampling strategy described in Algorithm 2 to each partition \(\mathcal {P}_{i}\) of \(\varvec{\mathcal {P}}\) (i.e., the input is \(\mathcal {P}^{\mathcal {E}}_{i}\) instead of \(\mathcal {E}\)) and treat the resulting edge set as the edge set of the corresponding partition. Here, we set k to 10.

1.3 Details of hyperaprameter settings

We now provide detailed hyperparameter settings of representation models and training methods. The number of layers and hidden dimension of all representation models are fixed to 2 and 128, respectively.

For representation models that are trained via supervised learning methods, we train each model for 100 epochs. We tune a learning rate of each model within \(\{0.01, 0.001, 0.0001\}\). For every 10 epochs, we measure the validation AP score and save the model parameters. Then, we designate the checkpoint with the highest validation AP score as the final model parameters.

For representation models that are trained via all versions of PCL, we tune the number of self-supervised learning epochs within \(\{25, 50\}\), while we set a broader search space, specifically \(\{20, 40, 60, 80, 100\}\), for that of other self-supervised learning methods. We tune the learning rate of the self-supervised learning within \(\{0.001, 0.0001\}\) for all self-supervised learning methods. In addition, for methods that require augmentation steps, we tune the extent of node feature augmentation \(p_{v}\) within \(\{0.3, 0.4\}\), and the extent of topological augmentation \(p_{e}\) within \(\{0.3, 0.4\}\). Furthermore, for methods that require negative samples for contrastive learning, we tune the number of negative samples N within \(\{1, 2\}\). The temperature parameter \(\tau\) for all self-supervised learning methods, and the scalar \(\lambda\) that controls the strength of inter-partition loss in PCL+PINS are both fixed to 0.5. Lastly, we train downstream task classifiers of all self-supervised learning methods with a learning rate of 0.001. We train the classifiers for 100 epochs, and for every 10 epochs, we measure the validation AP score and save the classifier parameters. Then, we designate the checkpoint with the highest validation AP score as the final classifier parameters.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, S., Lee, D., Kim, Y. et al. Datasets, tasks, and training methods for large-scale hypergraph learning. Data Min Knowl Disc 37, 2216–2254 (2023). https://doi.org/10.1007/s10618-023-00952-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-023-00952-6

Keywords

Navigation