research-article

Open access

User Cold-Start Recommendation via Inductive Heterogeneous Graph Neural Network

Authors:

Desheng Cai,

Shengsheng Qian,

Quan Fang,

Jun Hu,

Changsheng XuAuthors Info & Claims

ACM Transactions on Information Systems, Volume 41, Issue 3

Article No.: 64, Pages 1 - 27

https://doi.org/10.1145/3560487

Published: 07 February 2023 Publication History

All formats PDF

Abstract

Recently, user cold-start recommendations have attracted a lot of attention from industry and academia. In user cold-start recommendation systems, the user attribute information is often used by existing approaches to learn user preferences due to the unavailability of user action data. However, most existing recommendation methods often ignore the sparsity of user attributes in cold-start recommendation systems. To tackle this limitation, this article proposes a novel Inductive Heterogeneous Graph Neural Network (IHGNN) model, which utilizes the relational information in user cold-start recommendation systems to alleviate the sparsity of user attributes. Our model converts new users, items, and associated multimodal information into a Modality-aware Heterogeneous Graph (M-HG) that preserves the rich and heterogeneous relationship information among them. Specifically, to utilize rich and heterogeneous relational information in an M-HG for enriching the sparse attribute information of new users, we design a strategy based on random walk operations to collect associated neighbors of new users by multiple times sampling operation. Then, a well-designed multiple hierarchical attention aggregation model consisting of the intra- and inter-type attention aggregating module is proposed, focusing on useful connected neighbors and neglecting meaningless and noisy connected neighbors to generate high-quality representations for user cold-start recommendations. Experimental results on three real datasets demonstrate that the IHGNN outperforms the state-of-the-art baselines.

1 Introduction

The personalized recommendation is an important and challenging task that has been receiving substantial attention from the academic community. For example, traditional collaborative filtering-based recommendation algorithms, such as matrix factorization (MF) techniques [17], have been developed to learn expressive representations (lookup tables) for users and items. They use past user-item ratings to predict future ratings for the personalized recommendation. Although much research has been done to develop reliable and efficient algorithms for the personalized recommendation task, these existing works are still hampered by the cold-start problem, that is, the scenario needs to deal with new users or items while their vector embeddings have not been learned due to lack of preference information.

In cold-start recommendation, there are two information spaces, an attribute information space and a behavior information space. The attribute space describes the user’s or item’s preferences (e.g., user’s personal information, item’s content information), and the behavior space is used to represent user interactions (e.g., purchase behavior and past interactions). Most of the existing cold-start recommendations assume that there are no behavior interactions but that there is abundant attribute information for new users or new items. Existing methods for the cold-start recommendation task can be roughly categorized as three research lines. (1) Content-based recommendation methods [6, 26, 43] utilize simple feature information of items (e.g., categories, textual content, related images, reviews information) and users (e.g., locations, devices, apps, gender) to learn their respective representations for cold-start recommendation. (2) Some hybrid methods [8, 18, 34] have been proposed to extend MF [17] (traditional and probabilistic) so that user- and item-related information can be learned in their respective representations. (3) Deep learning–based hybrid approaches [1, 3, 10, 20, 34, 41] aim to employ deep neural networks to obtain feature representations from user- and item-related attribute information and further incorporate these attributes into a collaborative filtering model for cold-start recommendation. In this article, we are committed to solving user cold-start recommendation problems.

Although these existing models have shown effectiveness in the user cold-start recommendation task, most of them heavily rely on user attribute information and usually ignore existing rich, heterogeneous relationship information among new users (new users and their corresponding attributes) and existing historical information (existing users, items, and their related attribute information), such as interaction relationships (e.g., \(u_1 - v_2\), \(u_1 - v_3\), \(u_2 - v_3\), \(u_2 - v_1\)), co-occurrence relationships (e.g., \(v_4 - a_1\), \(u_3 - a_2\), \(v_4 - a_2\), \(u_3 - a_4\), \(u_3 - a_3\)), and inclusion relationships (e.g., \(a_5 - a_8\), \(a_6 - a_9\), \(a_6 - a_8\), \(a_7 - a_9\), \(a_7 - a_8\)), as shown in Figure 1(a).

Fig. 1.

With the rapid development of deep learning in various fields recently [2, 13, 22, 28, 30, 37, 47], Graph Neural Networks (GNNs) [9, 16, 38] also have been given increasing consideration due to their impressive ability to model information comprising components and their dependency, which have yielded extraordinary improvements for recommendation systems [25, 36, 45]. It persuades us to leverage the predominant capacity of GNN models for exploiting existing useful relationship information between users and existing historical information and further obtaining the superior recommendation model. However, most existing GNN-based recommendations aim to explicitly encode the crucial collaborative signal of user-item interactions to enhance the user-item representations through the propagation process based on user-item bipartite graphs. For example, STAR-GCN [48] leverages a stacked and reconstructed GCN encoder-decoder on the user-item bipartite interaction graph to obtain representations of users and items for cold-start recommendation. These methods take into account only side information or attributes of users and items as initial features, ignoring rich, heterogeneous relationship information among users, items, and their corresponding attributes. In fact, such rich relationship information reflects hidden interdependencies among data and can be used to get more relevant information for users. Therefore, we have to face Challenge 1: How to effectively model rich, heterogeneous and hidden relationship information and then further enrich user attributes based on this relationship information.

Another drawback of most existing user cold-start recommendation models is that they infer the representation of new users based on their related content information but usually do not take into account the heterogeneity and different impacts of this information. For instance, in Figure 1(b), each new user is always related to some heterogeneous property information (e.g., locations, mobile phone types, using apps). Among the heterogeneous attribute information, related information using apps should have more influence on the embedding of each new user since using apps features is more representative than locations and mobile phone types for new users.

Actually, there are some heterogeneous GNN-based recommendation methods that leverage meta-paths to consider heterogeneity of graphs. For example, HeRec [31] is a heterogeneous graph representation learning–based recommendation method that can effectively extract different kinds of representation information in terms of different predesigned meta-paths in heterogeneous graphs and further combine these representations with extended matrix factorization models for improving personalized recommendation performances. Meta-path–based models heavily depend on the selection of meta-paths, and cannot take advantage of the influence of different types of nodes on the current node when aggregating content of neighboring nodes. Recently, some heterogeneous GNN-based recommendation methods, which rely on heterogeneous aggregations to consider heterogeneous neighboring nodes, have emerged. For instance, the Heterogeneous Graph Neural Network (HetGNN) [46], which mainly consists of the node type–based neighboring aggregating module and the heterogeneous node–type information combining module, is proposed for learning heterogeneous node representations by incorporating their heterogeneous content information. However, the HetGNN utilizes only relatively simple sampling operations to obtain neighboring nodes on user-item bipartite graphs and ignores rich relationships among multimodal attribute information and impacts of these relationships on representation learning. Therefore, we need to address Challenge 2: How to consider the heterogeneity of nodes and different impacts of multimodal attributes and rich relationships among these multimodal attributes for generating high-quality representations.

Moreover, conventional GNN-based recommendation methods directly produce the relational matrix heavily relying on existing inactive relationship knowledge, which may not precisely coordinate our objective and cause undesirable model performance. For instance, the graph consisting of users and their related attributes is predefined and prebuilt based on their co-occurrence relationships, which is not suitable for new user representation learning in the user cold-start recommendation systems because of the new users’ unseen co-occurrence relationships. Thus, Challenge 3 should be concerned with the following: How to consider unseen non-static relationships associated with new users to obtain comprehensive and high-quality representations for them.

To handle these challenges, we present a novel user cold-start recommendation model: the Inductive Heterogeneous Graph Neural Network (IHGNN). For Challenge 1, we first create a modality-aware heterogeneous graph (M-HG) to model existing hidden heterogeneous relationships associated with users and items. Then, we design a sampling strategy based on random walk operations to sample heterogeneous neighbors for each node, which can be regarded as nodes’ enriched information. For each node, multiple samples are taken to generate multiple sets of sampled neighbors if a single sample may miss important nodes. For Challenge 2, we design a new kind of hierarchical attention aggregation network that comprises two distinctive levels of module, an intra-type self-attention aggregating module and inter-type attention aggregating module, to aggregate nodal features of the sampled heterogeneous neighboring nodes. For each set of scanned neighbors, we first group those scanned neighboring nodes in terms of their corresponding node types. Then, we leverage an intra-type self-attention aggregating module for each generated neighboring group, which is designed to obtain meaningful attention weights among homogeneous nodes to aggregate homogeneous node feature information. Based on these generated representations of all of the different neighboring groups, we further use an inter-type attention aggregating module designed to obtain significant attention weights for all neighboring groups to generate a useful vector representation for each group of scanned neighbors. Finally, we fuse all group representations to produce the final representation for each node. For Challenge 3, we infer new users’ representations based on the inductive capacity of our proposed model. For all of the unseen new users, our model first builds connections based on their sparse attributes with a constructed heterogeneous graph. Then, we take multiple sampling strategies for each new user to generate its corresponding related multiple sets of sampled neighbors. Finally, we generate new users’ representations by the well-trained hierarchical attention aggregation network. In addition, our proposed model forecasts new users’ inclinations by calculating matching scores between users and items with their obtained embeddings.

In summary, our contributions are as follows:

•

We propose a novel Inductive Heterogeneous Graph Neural Network (IHGNN) for user cold-start recommendation. IHGNN uses a heterogeneous graph that can take rich and heterogeneous relationship information among users, items, and their corresponding attributes into consideration. Further, a sampling strategy based on random walk operations is proposed to sample correlated heterogeneous neighbors to produce better representations of new users.

•

The IHGNN utilizes a multiple hierarchical attention mechanism on those sampled related neighbors and consider the impacts of homogeneous and heterogeneous multimodal attributes and various neighboring node groups in order to obtain the nodes’ embeddings, including new users.

•

We show experimental results on three real datasets (the Kwai dataset, the Tiktok dataset, and the movieLens dataset). Compared with the state-of-the-art models, our proposed model performs better in the user cold-start recommendation task.

2 Related Work

This work is associated with cold-start recommendation tasks, graph neural networks, and heterogeneous graph neural networks, which are briefly reviewed.

2.1 Cold-Start Recommendation Tasks

Whereas collaborative filtering (CF) [7, 17, 43, 44] has accomplished impressive performance in proposal recommendation frameworks, it regularly has trouble managing new users or new items with sparse historical interaction information, commonly known as cold-start recommendation problems. In such cases, the only way that a personalized recommendation can be generated is to incorporate additional attributes information. Existing cold-start methods can be coarsely classified into content-based recommendation methods [6, 8, 26], MF-based hybrid methods [8, 18, 34], and deep leaning–based hybird approaches [1, 3, 10, 20, 34, 41]. For example, SVDFeature [6], a feature-based collaborative filtering model, is proposed to handle MF with pretrained features. The capacity of leveraging pretrained features permits us to construct factorization frameworks to incorporate side information such as neighbor relationship, temporal dynamics, and hierarchical information so that it can be used to improve the cold-start models’ performances effectively. A hybrid model [34] is designed to gather the leading factors of autoencoder frameworks in an arrangement to standardize the utilization of autoencoders for CF-based recommendation systems. It presents an appropriate training process of autoencoder models for deficient data and further coordinates both ratings and side data into a single autoencoder framework to handle the cold-start problem. Neural Matrix Factorization (NCF) [10] is a well-known collaborative filtering model that can capture the critical points in collaborative filtering information (interactions between users and items) effectively, which adopts matrix factorization and further applies an inner product operation to the learned representations of users and items. The existing neural matrix factorization methods can easily utilize content information of users and items as pretrained vectors to address the cold-start problem. Existing algorithms show promising performance in the area of cold-start recommendations. However, most of these methods heavily depend on new user attribute information and usually ignore the sparsity problem of new user attributes.

2.2 Graph Neural Networks

The main factor of GNNs is to employ neural networks for aggregating content information for nodes from their local neighbors. For example, graph convolutional networks [16], GraphSAGE [9], and graph attention networks [38] leverage convolutional operations, long short-term memory (LSTM) aggregators and self-attention modules to gather neighboring information, respectively, which are applied in many fields [12, 13, 27, 29]. Inspired by the advantage of GNN techniques in simulating the information diffusion process, recent efforts have studied the design of GNN-based recommendation methods that mostly apply the GNN on the original user-item bipartite graph directly to learn more expressive representations of users and items [45]. For example, Multi-Graph Convolution Collaborative Filtering (Multi-GCCF) [35] explicitly incorporates multiple graphs in the embedding learning process. Multi-GCCF not only expressively models the high-order information via a bipartite user-item interaction graph but also integrates the short-range information by building user-user and item-item graphs by adding edges between two-hop neighbors on the original graph to obtain the user-user and item-item graph In this way, the proximity information among users and items can be explicitly incorporated into user-item interactions. Multi-Component graph convolutional Collaborative Filtering (MCCF) [40] is designed to distinguish the latent purchasing motivations underneath the observed explicit user-item interactions. MCCF uses a decomposer module to decompose the edges in a user-item graph to identify the latent components that may cause the purchasing relationship and further recombines these latent components automatically to obtain unified embeddings for prediction. Disentangled graph collaborative filtering (DGCF) [14] attempts to model the importance of diverse user-item relationships in collaborative filtering for obtaining better interpretability and considers user-item relationships at the finer granularity of user intents for generating disentangled representations. Dual channel hypergraph collaborative filtering (DHCF) [39] leverages the divide-and-conquer strategy with CF to integrate users and items for recommendation while still maintaining their specific properties and further employs the hypergraph structure for modeling users and items with explicit hybrid high-order correlations. The hierarchical bipartite Graph Neural Network (HiGNN) [21] utilizes stacking multiple GNN modules and a deterministic clustering algorithm alternately to effectively and efficiently address the problem of utilizing high-order connections and non-linear interactions through hierarchical representation learning on bipartite graphs for predicting user preferences on a larger scale. There are also sampling strategies proposed to make GNN efficient and scalable to large-scale graph-based recommendation tasks. For instance, Pinsage [45] incorporates graph structure information and node content information (e.g., visual content, textual content) and uses a novel training method that depends on harder training samples to obtain useful representations of users and items for higher-quality recommendations at Pinterest.

Moreover, GNNs can be applied to train and obtain more expressive representations for users and items in cold-start recommendations [25, 36, 42, 45, 48]. For example, RMGCNN [25] combines a graph convolutional network for multiple graphs that can capture stationary patterns for users and items and an RNN module that can leverage a learnable diffusion process module with non-linear operation to generate the known ratings. In detail, RMGCNN can extract graph local statistical structure patterns for users and items, including cold-start users and cold-start items, in terms of their high-dimensional feature spaces and further apply these learned expressive embeddings to predict interaction ratings. GCMC [36] is an autoencoder framework based on user-item bipartite graphs and used for the matrix completion problem. The autoencoder framework generates embeddings for users and items through a message-passing operation on the user-item bipartite interaction graph. These representations are further leveraged to reproduce the links through a bilinear operation. Ying et al. [45] have developed a highly scalable GCN framework, PinSage, that combines random walk operation and multiple graph convolution modules to generate nodes’ representations. PinSage incorporates graph structure information and node content information (e.g., visual content, textual content) and uses a novel training method that depends on harder training samples to obtain useful representations of users and items for higher-quality recommendations at Pinterest. Chen et al. [5] propose a general bipartite embedding method called Folded Bipartite Network Embedding (FBNE) for social recommendation, which explores the higher-order relationships among users and items by folding the user-item bipartite graphs, and a sequence-based self-attention module that learns node representations via node sequences sampled from graphs. FBNE aims to leverage implicit social relations in social graphs and higher-order implicit relations to enhance the user-item representations and boost the performance of current social recommendations, including cold-start recommendations. However, FBNE may ignore rich and heterogeneous relationship information among users, items, and their corresponding attributes as shown in Figure 1. In addition, the heterogeneity of nodes and different impacts of multimodal attributes are not taken into consideration for learning representations. STAR-GCN [48] leverages a stacked and reconstructed GCN encoder-decoder on the user-item bipartite interaction graph with intermediate supervision information to achieve better prediction performance. STAR-GCN masks a portion of or the complete users and items representations and remakes these concealed representations with a block of graph encoder-decoder within the training stage, which can make learned embeddings more expressive and generalize the method to obtain useful representations of unseen nodes for the cold-start recommendation. However, these GNN models are designed to embed homogeneous graphs or bipartite graphs and, therefore, cannot take advantage of the rich relationships connecting different types of heterogeneous data. Thus, they may not be applicable to the actual cold-start scenario.

2.3 Heterogeneous Graph Neural Networks

Heterogeneous graphs [30] can be used to model multiple complex object types and capture the plentiful correlations between these types effectively. Many existing heterogeneous graph-based recommendation models leverage various hand-designed semantic meta-paths combined with an MF framework for recommendation tasks [11, 32, 33, 49]. For example, Shi et al. [33] designed a semantic meta-path-based recommendation approach, termed SemRec, to calculate the ratings between users and items for the personalized recommendation flexibly. SemRec contains the weighted meta-path concept that can be used to subtly portray various meta-path semantics by distinguishing diverse properties. In addition, based on the well predesigned meta-paths, SemRec can calculate prioritized and personalized attention values representing user preferences on various meta-paths while incorporating heterogeneous information flexibly. HeRec [31] is a heterogeneous graph representation learning-based recommendation method that can effectively extract different kinds of representation information in terms of different predesigned meta-paths in heterogeneous graphs. It further combines these representations with extended MF models for improving personalized recommendation performances. Although these approaches are designed to embed heterogeneous graphs, these models heavily rely on the meta-path design process and may not effectively capture high-order structural data for cold-start recommendation tasks. Recently, the HetGNN [46], which mainly consists of the node type–based neighboring aggregating module and the heterogeneous node type information-combining module, was proposed for learning heterogeneous node representations by incorporating their heterogeneous content information. Although heterogeneous node type information in various heterogeneous graphs can be effectively used by the node type–based neighboring aggregating module and the heterogeneous node type–combining module, we argue that the HetGNN may not effectively incorporate heterogeneous contents of nodes due to the lack of plentiful relationship information between the multimodal contents of nodes. HHFAN [4] is also a heterogeneous representation learning-based model that utilizes heterogeneous relationships to enhance representations of users and items for micro-video recommendation tasks and is not suitable for cold-start recommendation tasks. To utilize the highly complex relationships between users (including cold-start users), items, and their corresponding associated multimodal attributes, the IHGNN leverages a modality-aware heterogeneous graph to learn their expressive representations for user cold-start recommendation tasks. Furthermore, a novel hierarchical feature aggregation network, which mainly comprises intra- and inter-type feature aggregating modules, has been designed to incorporate intricate graph structure information and abundant node content semantic information contained in M-HGs for obtaining nodes’ expressive representations, including cold-start nodes.

3 The Proposed Algorithm

3.1 Problem Statement

In this article, we focus on the user cold-start recommendation task, in which all items are denoted as \(Item = \lbrace item_1, item_2, \ldots , item_{|Item|}\rbrace\) and all new users are denoted as \(User = \lbrace user_1, user_2, \ldots , user_{|User|}\rbrace\). Each new user is related to sparse attributes denoted as \(User_{attrs}\) (e.g., locations, phone information, apps installed on their phones): \(User_{attrs} = \lbrace user^a_1, user^a_2, \ldots , user^a_{|User_{attrs}|}\rbrace\). Similarly, the set of multimodal attributes in terms of each item is defined as \(Item_{attrs}\) (e.g., locations, some related tags, visual content, audio content, textual description of items): \(Item_{attrs} = \lbrace item^a_1, item^a_2, \ldots , item^a_{|Item_{attrs}|}\rbrace\). Since many used attributes of users, such as apps, and items are related to tags, we adopt predesigned tag trees to illustrate these associations among various tags. In this article, we define predesigned tag trees as a set \(T = \lbrace t^1, t^2, \ldots , t^{|T|}\rbrace\) and a tag tree as \(t^i, 1 \le i \le |T|\). We build a user-item bipartite interaction matrix as \(I \in {R^{|User| \times |Item|}}\), in which the entry \(i_{uv}\) is described from user’s implicit feedback data, \(i_{uv} = 1\) demonstrates that there exists an interaction relationship between new user u and item v, and \(i_{uv} = 0\) shows that no interaction relationship exists between new user u and item v. Note that only new users in the training set have interaction data, whereas new users in the testing set have no interaction data. In the training phase, we strive to acquire appropriate high-quality representations for new users and items with their attribute information and correlations between them, which can preserve their historical interaction data effectively. In the testing phase, we focus on inferring representations of new users in the test group and further computing the rating points for the preference estimate to predict these new user preferences for cold-start user recommendations. Table 1 lists the key notations of the IHGNN model.

Table 1.

Notation	Description
Item	the item set
User	the user set
\(User_{attrs}\)	the attribute set of users
\(Item_{attrs}\)	the attribute set of items
T	the tag trees
\(t^i\)	the i-th tag tree
I	the user-item interaction matrix
\(i_{uv}\)	the implicit feedback value of user u and item v
\(G=\left(V, R, TE_{V}, TE_{E} \right)\)	a modality-aware heterogeneous graph
V	the node set
R	the edge set
\(R_{tt}\)	the inclusion relationships between tags
\(TE_{V}\)	the set of node types
\(TE_{E}\)	the set of edge types
X	the feature matrix
\(x_v\)	the feature of node v
\(nei_v\)	the related node of node v
\(SN_{v}\)	the sampled neighbors set of node v
\(SN_{vt}\)	the t-type sampled neighbors set of node v
\(Aggregator^{t}\)	the t-type neighbors aggregator function
\(\alpha ^{v,i}\)	the attention weight of the i-type sampled neighbor set of node v
\(\mathcal {E}_v\)	the combined embedding of the i-th sampling neighbors of node v
\(\mathcal {UE}_v\)	the final representation of node v
\(y_{uv}\)	the predicted preference probability value of user u and item v
\(\lambda\)	the regularization weight
\(\theta\)	the parameters of the model
d	the aggregated embedding dimension
K	the ranking position
k	the top k results

Table 1. The Main Notations of Our Proposed Model

3.2 Overall Framework

In order to handle the challenges in Section 1, we propose a well-designed framework, the Inductive Heterogeneous Graph Neural Network (IHGNN), for user cold-start recommendation. Our framework, as demonstrated in Figure 2, comprises four key components:

Fig. 2.

•

Heterogeneous Graph Construction: A modality-aware heterogeneous graph (M-HG) is built for modeling users, including new users, items, and associated attribute data based on abundant relationship information, such as the relational data among users, items, and their properties; user-item bipartite graph information; and tag trees.

•

Multiple Hierarchical Attention Aggregation Networks: In terms of each node in the constructed heterogeneous graph, we design and exert a sampling strategybased on random walk operation and apply it to sample associated fixed-size heterogeneous neighboring nodes. Then, multiple samples are taken to create multiple groups of sampled neighbors to obtain more critical neighbors. To gather node feature information of all sets of sampled heterogeneous neighboring nodes for each node, we propose an innovative hierarchical attention aggregation network for each set of sampled neighbors in three steps. (1) Grouping operation: We group these sampled neighbors in a set based on types of neighbor nodes. (2) Intra-type feature aggregating module: For each node group, we leverage an intra-type self-attention aggregating module for aggregating the content features of all neighboring nodes and generate each group’s aggregated type-based neighbor group representation. (3) Inter-type feature aggregating module: After the generation of representations of each heterogeneous type-based neighbor group, an inter-type attention aggregating module is designed to quantify the various implications of heterogeneous groups and to further calculate the embedding for each group of sample neighbors, taking into account these effects learned by the group. We design a fusion module to merge all embeddings of sampled heterogeneous neighbor sets to get the final representation for each node in the heterogeneous graph.

•

Model Optimization: To learn representations in the constructed heterogeneous graph that can preserve the implicit structure information and to generate high-quality new users’ representations, we define the KL-divergence between the hypothetical distribution and empirical distribution as loss function optimized through backpropagation and the mini-batch Adaptive Moment Estimation (Adam) [15].

•

Inductive Representation Learning for User Cold-Start Recommendation: We infer new users’ representations based on the inductive capacity of our proposed model. For all unseen new users, our model first builds connections with constructed heterogeneous graphs. Then, we take a multiple sampling strategy for each new user to generate its corresponding related multiple sets of sampled neighbors. Finally, we generate new users’ representations by the well-trained hierarchical attention aggregation network. After optimization, we can infer embeddings for new users in testing sets using our proposed model. An inner product operation is utilized to integrate the inferred embeddings of new users and candidate items and further predict the likelihood of preference, which can indicate the level of preference of the new users for the candidate items.

4 Methodology

This section presents our IHGNN model for user cold-start recommendation.

4.1 Heterogeneous Graph Construction

To consistently and effectively consider users, items, their respective attributes, and different associated relationship information, we build an M-HG, denoted as \(G = (V, R, {TE}_{V}, TE_{E})\). In the M-HG, V represents various kinds of nodes and R represents various relationships between nodes, where \(V = User \cup Item \cup User_{attrs} \cup Item_{attrs} \cup T\) and \(R = I \cup R_{ua} \cup R_{ia} \cup R_{ut} \cup R_{it} \cup R_{tt}\). \(R_{ua}\) represents relationships/connections between users and their corresponding sparse attributes. \(R_{ia}\) represents relationships/connections between micro-videos and their corresponding multimodal attributes. \(R_{ut}\) represents relationships/connections between users and their corresponding tags. and \(R_{it}\) represents relationships/connections between micro-videos and their corresponding tags. \(R_{tt}\) denotes incorporation connections between tags within the tag trees T. \(TE_{V}\) denotes the node type set which comprises user node type, item node type and attribute node types: \(TE_{V} = \lbrace type^{user}, type^{item}, type^{image}, type^{audio}, type^{text}, type^{tag}\rbrace\). \(TE_{E}\) is the set relationship type set, which includes the relation type of set \(R_{ua}\), \(R_{ia}\), \(R_{ut}\), \(R_{it}\) and \(R_{tt}\): \(TE_{E} = \lbrace type^{i}, type^{ua}, type^{va}, type^{ut}, type^{vt}, type^{tt}\rbrace\).

4.2 Multiple Hierarchical Attention Aggregation Networks

For each node v, \(\forall {v} \in V\), we aim to enrich its node feature representation by aggregating its related node information in the constructed graph G, which can be formulated as follows:

\begin{equation} feature(v) = \int feature(nei_v)ds, \end{equation}

(1)

where the function \(feature(\cdot)\) is interpreted as the embedding of the node, \(nei_v\) is a random variable and represents the related node of the node v, and the variable \(s(s=importance(nei_v))\) can be interpreted as the importance or impact of the node \(nei_v\). However, there are plenty of related nodes for each node v, making it computationally expensive to gather information from all of the related nodes. To deal with this problem, we use a sampling strategy, the Random Walk Sampling Strategy (RWSS), to reduce the computational cost, so that Equation (1) can be reformulated as follows:

\begin{equation} feature(v) = \frac{1}{|SN_v|}\sum _{nei_v \in SN_v}feature(nei_v), \end{equation}

(2)

where \(SN_v\) represents sampled neighbors of node v.

Specifically, the RWSS is comprises two steps:

•

Beginning a random walk of settled length from node \(v, \forall {v} \in V\). The walk travels iteratively to the current node’s neighbors or restarts from the beginning node with a certain probability p. The walk operation runs until there is a settled variety of nodes collected effectively, referred to as \(RWSS(v)\). Note that the number of nodes of all types in \(RWSS(v)\) is fixed to guarantee that each type of node is sampled for v.

•

Selecting several types of neighboring nodes. For node v, we find top \(k_t\) nodes for node type t from \(RWSS(v)\) based on their frequencies and further denote these selected nodes as the set of t-type associated neighbors for node v, denoted as \(SN_v = \lbrace node_1, node_2, \ldots , node_{|SN_v|}\rbrace\).

Compared with the general random walk method, the RWSS has two advantages, which is important for enriching a representation of each node, especially for cold-start nodes. (1) For each node in the constructed graph M-HG, the RWSS ensures that each type of node, whether it is a first-order or high-order neighbor, can be evenly sampled by constraining the number of each type of node in the sampling result. (2) For each node v, the RWSS selects the top \(k_t\) nodes for node type t from the \(RWSS(v)\) according to the frequency, which can better reduce the negative impact of noise nodes in the constructed graph M-HG on the performance of our model.

To ensure the effect of the sampling strategy, the RWSS is conducted multiple times, and the result is denoted as \(MulSN_v = \lbrace SN^1_v, SN^2_v, \ldots , SN^{|MulSN_v|}_v\rbrace\). Note that multiple sampling operations may also face time-consuming problems. Thus, for each node, we do the sampling operation in advance during the preprocessing phase of the experiments. At the same time, we analyzed the relationship between the number of samplings and the effect of the model in Figure 8 in the experimental section to find the number values of sampling that can be trade-offs.

In an M-HG graph G, the representation of node \(v \in V\) is denoted as \({x_{v} \in {R^{d_{f} \times 1}}}\), where \(d_f\) represents dimension of representations. Note that we can leverage CNN-based methods [23] for pretraining nodes if the node type is an image or the doc2vec model [19] to pretrain nodes if the node is textual; it also can initialize nodes, which is based on types of nodes. Here, we adopt a content vector transformer module, \(\mathcal {FC}_{\theta _{x}}\), to turn embeddings of various types of nodes into a unified space. Formally, the transferred representation for node v is calculated as follows:

\begin{equation} f(v) = \mathcal {FC}_{\theta _{x}}(x_v), \end{equation}

(3)

where \(f(v) \in {R^{d \times 1}}\) and d is the transferred representation dimension.

For aggregating node representations transferred by \(\mathcal {FC}_{\theta _{x}}\) of all sampled neighboring nodes of node v, we propose an innovative hierarchical attention aggregation network, as shown in Figure 3, for each set of sampled neighbors, comprising three steps: (1) a grouping module; (2) an intra-type feature-aggregating module; and (3) an inter-type feature-aggregating module.

Fig. 3.

4.2.1 Grouping Module.

After using the RWSS in the previous section, multiple sampling results \(MulSN_v\) are obtained. For each set of sampled neighbors \(SN_v\), we first group them based on their node types. These groups consist of three categories: several multimodal attribute neighbor node groups, a user neighbor node group, and an item neighbor node group. These multimodal attribute neighbor node groups are further divided into four subcategories: an image neighbor node group, an audio neighbor node group, a tag neighbor node group, and a text neighbor node group. They represent related multimodal attribute information for users (including new users) and items. Here, we define the t-type sampled neighboring node group in \(SN_v\) as \(SN_{vt}\).

4.2.2 Intra-type Feature-Aggregating Module.

For t-type group \(SN_{vt}\), we use a neural network to generate node representation for \(v_{ner} \in SN_{vt}\). Formally, the aggregated t-type neighboring node group’s representation for v can be formulated as follows:

\begin{equation} f^t(v) = Aggregator^t_{v_{ner} \in SN_{vt}} \lbrace f(v_{ner})\rbrace , \end{equation}

(4)

where \(f^t(v) \in {R^{d \times 1}}\), d represents the aggregated t-type neighboring node group’s representation dimension, \(f(v_{ner})\) is the transferred node representation of node \(v_{ner}\), and \(Aggregator^t\) is the t-type neighbor group’s aggregator function. A self-attention technique as the \(Aggregator^t\) function can be applied to obtain attention values between homogeneous nodes in each group. Following Transformer [37] which is composed of a stack of multi-head self-attention layers and point-wise, fully connected layers for both encoders and decoders, we define our self-attention intra-type feature aggregation module.

Given a set of input features, denoted as \(X \in {R^{N \times d}}\), self-attention transforms them into the matrices of queries \(Q \in {R^{N \times d}}\), keys \(K \in {R^{N \times d}}\) and values \(V \in {R^{N \times d}}\), given by

\begin{equation} Q = (X+PE)W_Q, K = (X+PE)W_K, V = XW_V, \end{equation}

(5)

where \(W_Q\), \(W_K\), and \(W_V\in {R^{N \times d}}\) are learnable projection weights, PE is the absolute Positional Embedding of the features, N is the number of rows of the input features X, and d is the dimension of the input feature X. Note that we use PE to represent the relative position sorted by frequency in each type of neighboring node group. The attention weights A are calculated as follows:

\begin{equation} A =softmax\left(\frac{QK^T}{\sqrt {d}}\right), \end{equation}

(6)

where a softmax function is applied to obtain the weights of values and \(\sqrt {d}\) is a scaling factor. Further, output weighted average vectors \(\hat{V}\) combined with the residual operation can be formulated as follows:

\begin{equation} \hat{V} = AV + X . \end{equation}

(7)

To improve the capacity, our self-attention module can be extended to the multi-head version. Output weighted average vectors of multi-head module \(\hat{V}_{mul}\) can be calculated as follows:

\begin{equation} \hat{V}_{mul} = Concat\lbrace \hat{V}_{head\_1}, \ldots ,\hat{V}_{head\_i}, \ldots \rbrace , \end{equation}

(8)

where Concat is a feature concatenation function and \(\hat{V}_{head\_i}\) is the output of the i-th self-attention head module. Moreover, as is well known, the atom operation of the self-attention mechanism—the canonical dot-product—causes the time complexity and memory usage per layer to be \(O(N^2)\). The stack of J encoder layers uses total memory \(O(J \cdot N^2)\), which limits the model scalability on receiving long and large inputs. Thus, in order to reduce the space-time complexity and improve the efficiency of the self-attention aggregation module, we randomly sample LnN keys to calculate attention weights. Controlled by a constant sampling factor c, we set the sampling num keys as \(c \cdot LnN\) for each query, which makes our self-attention aggregation module need to calculate only the \(O(LnN)\) dot-product for each query-key lookup and the layer memory usage maintains \(O(NLnN)\).

Then, output vectors \(\hat{V}_{mul}\) of our self-attention module are applied to a position-wise feed-forward function (FFN) with a single non-linearity, which are independently applied to each element of the set and are given by:

\begin{equation} FFN(\hat{V}_{mul}) = W_1 \sigma (W_2 \hat{V}_{mul} + b) + c , \end{equation}

(9)

where \(\sigma (x)\) is the ReLU activation function, \(W_1\) and \(W_2\) are learnable weights, and b and c are bias terms.

Finally, for ease of discussion, we denote this process with a layer normalization function Norm as SAtt:

\begin{equation} SAtt(X) = Norm(FFN(\hat{V}_{mul})) . \end{equation}

(10)

Based on this self-attention module, we reformulate \(f^t(v)\) as follows:

\begin{equation} f^t(v) = \frac{\sum _{v_{ner} \in SN_{vt}}SAtt\lbrace f(v_{ner})\rbrace }{|SN_{vt}|} , \end{equation}

(11)

where we leverage the self-attention module to aggregate transferred node representations of t-type neighbors and perform the averaging operation to generate the representation of the aggregated t-type neighboring node group.

4.2.3 Inter-type Feature Aggregating Module.

After that step, \(|TE_{N}|\) aggregated representations are obtained for \(SN_{v}\) of node v, defined as \(\mathcal {E}_v^{TE_{N}} \in {R^{|TE_N| \times d}}\). There are \(|TE_{N}|\) node types in the M-HG G, and different types of neighboring nodes may contribute to the generation of node representations differently. To fuse these aggregated neighbor representations into the representation for \(SN_{v}\) of node v through considering their influences on node v, we employ the attention method, which can be formulated as follows:

\begin{equation} \mathcal {E}_v = \sum \alpha ^{v,i} \mathcal {E}_{v,i}^{TE_N} \end{equation}

(12)

\begin{equation} \alpha ^{v,i} = \frac{exp\left(LeakyRelU\left(u^T\left[v\bigoplus \mathcal {E}_{v,i}^{T_V}\right]\right)\right)}{\sum exp\left(LeakyRelU\left(u^T[v\bigoplus \mathcal {E}_{v}^{T_V}]\right)\right)} , \end{equation}

(13)

where \(\mathcal {E}_v \in {R^{d \times 1}}\) is the combined representation for \(SN_{v}\) of node v, \(\bigoplus\) denotes the concatenation operation, \(\alpha ^{v,*}\) demonstrates the importance of different neighboring nodes’ embeddings, and \(u^T \in {R^{2d \times 1}}\) is the trainable attention parameter.

After the hierarchical attention aggregation operation for each set of sampled neighbors \(SN_v\) of node v, we obtain \(|MulSN_v|\) node embeddings for \(MulSN_v\), denoted as \(MulE_v = \lbrace \mathcal {E}^1_v, \mathcal {E}^2_v, \ldots , \mathcal {E}^{|MulE_v|}_v\rbrace\). To fuse these representations into the ultimate representation \(\mathcal {UE}_v\) for node v, we design a fusion module formulated as follows:

\begin{equation} \mathcal {UE}_v = MLP\big \lbrace Concat\big \lbrace \mathcal {E}^1_v; \mathcal {E}^2_v; \ldots ; \mathcal {E}^{|MulE_v|}_v\big \rbrace \big \rbrace , \end{equation}

(14)

where \(\mathcal {UE}_v \in {R^{d \times 1}}\), the function \(MLP()\) is interpreted as a full connection layer, and the function \(Concat()\) is a concatenation function that concatenates all of the representations in \(MulE_v\).

Note that we apply the same dimension d for the transferred node representation, the aggregated t-type neighboring node group’s representation, and the concatenated ultimate representation for node v to make IHGNN adjustment easier in this article.

4.3 Model Optimization

To learn the representation \(\mathcal {UE}_v \in {R^{d \times 1}}\) of each node v that can preserve the implicit structure information of the M-HG, the loss \(\mathcal {L}\) is denoted as the KL-divergence between the hypothetical distribution \(p(v_j|v_i)\) and empirical distribution \(\hat{p}(v_j|v_i)\) of \(\mathcal {N}(v_i)\), which is the set of direct neighbors of node \(v_i\):

\begin{equation} p(v_j|v_i) = \frac{\exp \left(\mathcal {UE}_{v_j}^T\mathcal {UE}_{v_i}\right)}{\sum _{v_k \in \mathcal {N}(v_i)} \exp \left(\mathcal {UE}_{v_k}^T\mathcal {UE}_{v_i}\right)}) \end{equation}

(15)

\begin{equation} \mathcal {L} = KL(p(\cdot |\cdot),\hat{p}(\cdot |\cdot)) , \end{equation}

(16)

where \(KL(\cdot ,\cdot)\) denotes the KL-divergence. The empirical probability \(\hat{p}(v_j|v_i)\) is set to 1 if \(v_j \in \mathcal {N}(v_i)\) and 0 otherwise. With \(KL(\cdot ,\cdot)\) replaced by the KL-divergence equation and overlooking a few constants, the loss \(\mathcal {L}\) can be transformed as follows:

\begin{equation} \mathcal {L} = \sum _{v_j \in \mathcal {N}(v_i)} p(v_j | v_i) . \end{equation}

(17)

Optimizing the loss \(\mathcal {L}\) demands a full scan of neighbors \(\mathcal {N}(v_i)\) for each node v, which causes significant computational cost. Thus, we leverage negative sampling [24] and transform loss \(\mathcal {L}\) as follows:

\begin{equation} \mathcal {L} = \sum _{(i, j, k) \in NS} -\ln {\sigma \big ({\mathcal {UE}_{vj}}^{T} \mathcal {UE}_{vi} - {\mathcal {UE}_{vk}}^{T} \mathcal {UE}_{vi}\big)} + \lambda ||\theta ||_2^2 , \end{equation}

(18)

where \(\sigma (x)\) is the sigmoid function; NS is a set of sampled node triples; and where i, j, and k represent the \(i_{th}\) node \(v_i\), the \(j_{th}\) node \(v_j\), and the \(k_{th}\) node \(v_k\), respectively. Each sampled node triple meets \(v_j \in \mathcal {N}(v_i)\) and \(v_k \notin \mathcal {N}(v_i)\). \(\theta\) and \(\lambda\) are the hyper-parameters of our IHGNN and the regularization weight, respectively. The loss \(\mathcal {L}\) can be optimized by the backpropagation method and the Adam optimizer [15].

4.4 Inductive Representation Learning for User Cold-Start Recommendation

After optimization, we can obtain the learned representation \(\mathcal {UE}_v \in {R^{d \times 1}}\) of each node v in the M-HG G. For example, given a new user u in the testing set, with sparse attributes, we can infer the representation of the new user u based on the learned framework and learned representations of nodes in the M-HG G with three steps, as shown in Figure 4:

Fig. 4.

•

First, given a new user u with the user’s corresponding sparse attribute information (e.g., photo information, locations, photo apps), we consider u as a graph node and further connect the new user node with the input heterogeneous graph M-HG G in the training process based on the sparse attributes of the new user. This operation is the inductive new user representation learning of our IHGNN model.

•

Second, we regard new user u as a target user and infer ultimate embedding of new user u based on well-trained multiple hierarchical attention aggregation networks, which include the sampling operation with the RWSS, multiple hierarchical attention aggregations, and the fusion operation.

•

Finally, we calculate a preference score y, which demonstrates how much the user u prefers the candidate item v, by their final learned representations. Formally, we give the definition of the preference score calculation function as follows:

\begin{equation} y_{uv} = \sigma ({\mathcal {UE}_u}^{T} \mathcal {UE}_v) , \end{equation}

(19)

where \(\sigma (x) = \frac{\exp (x)}{1 + \exp (x)}\) is the sigmoid function, \(\mathcal {UE}_u\) and \(\mathcal {UE}_v\) are the learned final representations of user u and candidate items v, respectively, and \(y_{uv}\) is the predicted preference score value for user u and candidate item v.

For new users in the testing set, although they do not have historical interaction data and do not appear in the training data, the attribute data (e.g., phone information, locations, phone apps) of new users and existing users are shared so that we can leverage these shared attributes to infer representations of new users.

Note that representations of these shared attributes are learned in the training process and we initialize representations of new users by summing learned representations of their sparse attributes.

4.5 Algorithm Description

Our algorithm is presented in Algorithm 1. Given the heterogeneous graph M-HG G, the training user-item pairs P, and batch size B, our goal is to learn a multiple hierarchical attention aggregation network that can leverage shared attributes of new users to infer their representations for predicting their preference for micro-videos. The process of our algorithm is described in Algorithm 1, and the core is to train a high-quality multiple hierarchical attention aggregation network through backpropagation and the mini-batch Adaptive Moment Estimation (Adam).

5 Experimental Results

5.1 Datasets

We use Kwai, Tiktok, and MovieLens datasets for evaluation. Table 2 contains their statistics and we briefly describe them as follows:

Table 2.

Dataset	User	Micro-video	Interaction	Density
Tiktok	3,656	7,085	1,253,112	4.49%
Kwai	169,878	310,681	775,834,643	1.47%
MovieLens	6,040	3,706	1,000,209	4.47%

Table 2. The Statistics Detail of Three Real-World Datasets

•

Tiktok dataset¹: This dataset is published by a popular micro-video platform named Tiktok. It contains micro-videos created by users registered on the platform and interactions of user-video (e.g., click, like). We use the micro-videos features from multimodal data in the original dataset, which ignores the raw data.

•

Kwai dataset²: This dataset is extracted from a real-world micro-video sharing platform named Kwai, which contains users associated with attributes, micro-videos associated with attributes, and some relationship information, including user-video interactions.

•

MovieLens (MLs) dataset³: Movielens is a movie rating dataset that has been extensively applied to CF recommendation algorithms. We use the one million rating version, which removes users that have less than 20 rating records. To count the numbers, we build a vector for each user with each entry being a value of 0 or 1 to indicate whether the user has rated the movie.

5.2 Baselines

To evaluate the performance of the IHGNN, we consider several state-of-the-art approaches as baselines, including traditional methods and graph-based methods. Note that, for all baselines, we conduct experiments at the user cold-start scenario.

•

FM\(_{HIG}\): FM\(_{HIG}\) effectively combines factorization machine (FM)–based frameworks and any other side content feature information (e.g., locations), in addition to the user and item, for recommendation tasks. In this work, we feed heterogeneous information as side features into FM models for user cold-start recommendation tasks.

•

Neural Matrix Factorization (NCF): NCF [10] can fuse MF methods and neural networks to train and predict user-item bipartite interaction information for recommendation tasks.

•

GraphSAGE: GraphSAGE [9] is an unsupervised inductive graph representation learning framework on large graphs. GraphSAGE can be utilized to obtain expressive low-dimensional representations for graph nodes, including nodes unseen in the training stage due to its inductive learning capacity. It is especially useful for considering graph structure information and rich node attribute information from neighboring nodes.

•

STAR-GCN: STAR-GCN [48] designs a stacked and reconstructed GCN framework on user-item bipartite interaction graphs. STAR-GCN requires obtaining several rated edges connected with new nodes in the testing graph and further leverages these edges to make predictions, which may be suitable for cold-start problems.

•

HeRec: HeRec [31] is a heterogeneous graph representation learning–based recommendation method that can effectively extract different kinds of representation information in terms of different predesigned meta-paths in heterogeneous graphs, and further combines these representations with extended MF models for improving personalized recommendation performances.

•

HetGNN: The HetGNN [46] is a graph representation learning method for learning heterogeneous node representations by incorporating their heterogeneous content information. The HetGNN mainly consists of the node type–based neighboring aggregating module and the heterogeneous node type information–combining module to consider the heterogeneity of graphs.

•

IHGNN: The IHGNN is our proposed recommendation model, which can leverage M-HGs for preserving the rich and heterogeneous relationships among users, items, and their relevant attribute information. Furthermore, IHGNN utilizes a well-designed hierarchical attentive aggregation module to learn the representations of nodes, including new users, to consider the heterogeneity of M-HGs for user cold-start recommendation tasks.

5.3 Experimental Settings

For each dataset, we randomly held 80% and 60% of users for training, and the remaining users are treated as testing sets. To evaluate our approach and the compared baselines on user cold-start recommendation, we employ four widely used evaluation metrics [42]: Normalized Discounted Cumulative Gain at top k (NDCG@k), Recall at top k (R@k), Precision at top k (P@k), and Area under the ROC Curve (AUC). In practice, following experimental settings of the recommendation model NeuCF [10] and the micro-video recommendation model MMGCN [42], both of which are popular recommendation methods, we set Top k = 10 and report the average scores in the testing set.

In the training phase, we tune the hyper-parameters of our IHGNN model by utilizing a cross-validation method and search the hyper-parameters with a popular grid search method. We first randomly initialize the parameters of our model by adopting a Gaussian distribution, where we set the mean as 0 and the standard deviation as 0.02. To optimize our IHGNN model, we adopt a widely used optimizer named Adaptive Moment Estimation (Adam) [15] in a mini-batch way. The batch size is selected in the set {128, 256, 512}, the learning rate is searched in the set {1e-4, 1e-3, 1e-2, 5e-4, 5e-3, 5e-2} and the regularizer is selected in set {1e-5, 1e-4, 5e-2, 1e-2, 1e-3,}. Because we find the experimental settings are consistent when varying the dimensions of embeddings, if there is not a special explanation, we report our results when d = 200, which achieves a relatively good performance.

5.4 Quantitative Results

The experimental results of baselines and the IHGNN model are shown in Tables 3 and 4, with 80% and 60% of users used for training. From the results, we can make several conclusions and observations as follows.

Table 3.

Datasets	Metrics	FM\(_{HIG}\)	NCF	GraphSage	STAR-GCN	HeRec	HetGNN	IHGNN
Kwai	Pre	0.1901	0.2245	0.2891	0.2901	0.3789	0.3835	0.3931
	Rec	0.1773	0.2017	0.2991	0.3011	0.3858	0.3713	0.4011
	NDCG	0.2011	0.2441	0.2997	0.3012	0.3591	0.3812	0.3901
	AUC	0.6003	0.6521	0.7001	0.6881	0.7111	0.7311	0.7402
Tiktok	Pre	0.2012	0.1991	0.3015	0.2817	0.3601	0.3679	0.3721
	Rec	0.2101	0.2048	0.3301	0.2918	0.3901	0.3939	0.4005
	NDCG	0.2011	0.1811	0.3339	0.2912	0.3802	0.3811	0.4043
	AUC	0.6218	0.5991	0.6981	0.6725	0.7317	0.7411	0.7512
MLs	Pre	0.2442	0.2015	0.3129	0.2719	0.3598	0.3611	0.3701
	Rec	0.1911	0.1999	0.3195	0.2991	0.3759	0.3753	0.3871
	NDCG	0.2312	0.2331	0.3194	0.2849	0.3598	0.3522	0.3701
	AUC	0.6001	0.6419	0.6884	0.6512	0.7101	0.7159	0.7233

Table 3. Experimental Results of IHGNN and Baselines in Terms of All Datasets

training\(\_\)ratio = 0.6, k = 10.

Table 4.

Datasets	Metrics	FM\(_{HIG}\)	NCF	GraphSage	STAR-GCN	HeRec	HetGNN	IHGNN
Kwai	Pre	0.1034	0.1133	0.2242	0.2014	0.2729	0.3299	0.3535
	Rec	0.2034	0.2111	0.2954	0.2661	0.3219	0.3501	0.3712
	NDCG	0.3305	0.3327	0.3401	0.3391	0.3401	0.3631	0.3825
	AUC	0.5940	0.6881	0.7172	0.6901	0.7209	0.7525	0.7791
Tiktok	Pre	0.1141	0.1211	0.2523	0.2481	0.2781	0.3129	0.3321
	Rec	0.1121	0.2141	0.2943	0.2809	0.3001	0.3214	0.3505
	NDCG	0.3035	0.3112	0.3415	0.3101	0.3505	0.3501	0.3843
	AUC	0.5155	0.6501	0.6421	0.6911	0.7098	0.7278	0.7419
MLs	Pre	0.1449	0.1564	0.2921	0.2519	0.3013	0.3129	0.3201
	Rec	0.2019	0.2102	0.2731	0.2811	0.2939	0.3113	0.3471
	NDCG	0.2402	0.2435	0.3015	0.2901	0.3019	0.3412	0.3601
	AUC	0.6214	0.6555	0.6883	0.6672	0.7121	0.7311	0.7561

Table 4. Experimental Results of IHGNN and Baselines in Terms of All Datasets

training\(\_\)ratio = 0.8, k = 10.

First, our proposed IHGNN model reliably beats all baseline models on all three datasets for all four metrics, confirming the usefulness and superiority of the IHGNN for user cold-start recommendation tasks. It is predictable that traditional methods consistently yield the most exceedingly bad performances for all four metrics. Learning representations of users and items, especially for new users, simply by incorporating related content information into factorization machines is not adequate. They ignore rich and expressive relationship information. Compared with traditional approaches, graph-based methods achieve significant performance improvements. These results demonstrate that graph convolutional networks can be used to learn better representations of nodes in graphs, especially for inferring the new users’ representations, which may further improve and promote the quality of representations for user cold-start recommendation. For GraphSAGE and STAR-GCN, on the basis of our experiments, their performances endure from the declination compared with heterogeneous graph-based representation learning models, HeRec and the HetGNN. This might be because GraphSAGE focuses only on homogeneous graphs and ignores heterogeneity of data. Furthermore, STAR-GCN is based on a bipartite user-item interaction graph and cannot take heterogeneous content feature information into consideration. These comparisons further indicate that considering heterogeneous information, which mainly includes heterogeneous content information and various types of relationships, is vital to generate new users’ representations for user cold-start recommendations.

The HeRec model is worse than the HetGNN probably because its performance heavily depends on different kinds of representation information in terms of different predesigned meta-paths in heterogeneous graphs. Also, HeRec cannot take advantage of the influence of different types of nodes on the current node when aggregating content of neighboring nodes and may not effectively capture high-order structural data for cold-start recommendation tasks.

Compared with the heterogeneous aggregation-based method HetGNN, the IHGNN achieves better and has the following advantages in terms of the user cold-start recommendation task:

(1)

The advantage in HIN construction. Our model absorbs multimodal attribute data into heterogeneous graph nodes instead of just considering users and micro-videos as nodes and further utilizes the heterogeneous and rich relationships among these multimodal attributes as edges in the heterogeneous graph for cold-start recommendation tasks. However, the HetGNN utilizes only user-item interactions for constructing graphs.

(2)

The advantage in generating neighbors. The sampling and grouping module of our model searches and samples relevant heterogeneous neighboring nodes of each node based on rich relationships among multimodal attributes and further makes our model generate a more robust and comprehensive representation of each user and each micro-video. For sampling more related neighboring nodes of the current node, we design the Random Walk Sampling Strategy (RWSS), which is a random walk–based sampling strategy with a restart probability p for sampling related heterogeneous neighbors of each node in the complex graph M-HG. Compared with these exiting sampling strategies, the RWSS does not require any prior knowledge, such as meta-paths, to sample heterogeneous neighboring nodes and is not sensitive to interference from noisy nodes. Furthermore, the RWSS is conducted multiple times to ensure the effectiveness of the sampling strategy because a single sampling operation may miss some important nodes. However, the HetGNN utilizes relatively simple sampling operations to obtain neighboring nodes on user-item bipartite graphs. Moreover, the HetGNN heavily depends on user-item interactions and ignores these relationships among multimodal attribute data.

(3)

The advantage in feature aggregation module. Our feature aggregation module uses a novel hierarchical attention network, which consists of attribute-aware self-attention and neighbor-aware attention, to take the importance of multimodal attributes and different neighboring node types into consideration simultaneously for inferring representations of each user and each item, including unseen new nodes. The hierarchical design is used to consider the heterogeneity of constructed graphs. However,the HetGNN does not consider the importance among different multimodal attributes for representation learning of each node.

(4)

The advantage in inferring representations of new users. In this work, we consider a new user as a graph node and further connect the new user node with the input heterogeneous graph M-HG G in the training process based on the sparse attributes of the new user. This operation is the inductive new user representation learning of our model. However, the HetGNN relies only on sparse attributes to generate embeddings of new users, ignoring hidden relationships among users, items, and their attributes.

In conclusion, experimental results demonstrate that the proposed IHGNN model has the potential to generate better user cold-start recommendation performance.

5.5 Analysis of IHGNN Components

Since our proposed IHGNN consists of multiple key components, we demonstrate its effectiveness by comparing the following variants of the IHGNN:

•

IHGNN\(\lnot m\): It is a variant of the IHGNN that removes multiple sampling operations and sample-related neighbors of each node v only once. Then, the sampling result of each node v is copied multiple times.

•

IHGNN\(\lnot a\): It is a variant of the IHGNN that removes the self-attention component of the intra-type feature-aggregating component and sets the same importance values to multimodal attribute neighboring nodes.

•

IHGNN\(\lnot n\): It is a variant of the IHGNN that removes the attention module in the inter-type feature-aggregating component and sets the same importance values to neighboring node groups.

The ablation study results of Precision@10, Recall@10, NDCG@10, and AUC on three datasets are reported in Tables 5 and 6, with 60% and 80% of users for training. From the results, we can conclude that:

Table 5.

Datasets	Metrics	IHGNN\(\lnot m\)	IHGNN\(\lnot a\)	IHGNN\(\lnot n\)	IHGNN
Kwai	Pre	0.3901	0.3889	0.3511	0.3931
	Rec	0.3711	0.3710	0.3916	0.4011
	NDCG	0.3311	0.3761	0.3811	0.3901
	AUC	0.7011	0.7001	0.7219	0.7402
Tiktok	Pre	0.3112	0.3412	0.3331	0.3721
	Rec	0.3911	0.3914	0.3818	0.4005
	NDCG	0.3499	0.3812	0.3881	0.4043
	AUC	0.7391	0.7319	0.7411	0.7512
MLs	Pre	0.3599	0.3429	0.3311	0.3701
	Rec	0.3417	0.3312	0.3711	0.3871
	NDCG	0.3621	0.3519	0.3599	0.3701
	AUC	0.7158	0.7119	0.6911	0.7233

Table 5. Experimental Results of IHGNN and Its Key Components for All Datasets

training\(\_\)ratio = 0.6, k = 10.

Table 6.

Datasets	Metrics	IHGNN\(\lnot m\)	IHGNN\(\lnot a\)	IHGNN\(\lnot n\)	IHGNN
Kwai	Pre	0.3312	0.3381	0.3227	0.3535
	Rec	0.3345	0.3501	0.3616	0.3712
	NDCG	0.3443	0.3502	0.3719	0.3825
	AUC	0.7601	0.7402	0.7581	0.7791
Tiktok	Pre	0.3033	0.3112	0.3129	0.3321
	Rec	0.3101	0.3016	0.3121	0.3505
	NDCG	0.3704	0.3652	0.3513	0.3843
	AUC	0.7302	0.7319	0.7359	0.7419
MLs	Pre	0.3301	0.3051	0.3121	0.3201
	Rec	0.3099	0.3298	0.3132	0.3471
	NDCG	0.3201	0.3019	0.3402	0.3601
	AUC	0.7339	0.7101	0.7219	0.7561

Table 6. Experimental Results of IHGNN and Its Key Components in Terms of All Datasets

training\(\_\)ratio = 0.8, k = 10.

•

The IHGNN achieves better performance than the IHGNN\(\lnot m\) on three datasets, which demonstrates that the multiple-sampling operation can capture important neighboring nodes more precisely and effectively.

•

The IHGNN outperforms the IHGNN\(\lnot a\), which shows that the importance value of the same modal nodes (such as users, visual content, items, textual content, and acoustic content) can be better calculated through the intra-type self-attention component.

•

The results of the IHGNN are superior to the IHGNN\(\lnot n\), which indicates that the inter-type attention component can estimate the impact of various types of neighbor node groups (e.g. users, items, attributes) for obtaining final node embeddings effectively.

5.6 Hyper-parameter Sensitivity

We show extended experimental results to break down the influences of all key parameters for the IHGNN model which include the number of sampling operations, the ranking position K, the depth of sampling operation, and the aggregated representation dimension d for users and items on the three datasets.

Impact of the Ranking Position K: From Figures 5–7, it can be observed that the IHGNN shows consistent performance enhancements over graph-based methods across all position parameters, indicating the need to model heterogeneous information (heterogeneous content information and various types of relationships) as well as the superior capability of graph representation learning of our IHGNN framework.

Fig. 5.

Fig. 6.

Fig. 7.

Impact of the Number of Sampling Operations: From Figure 8, with the number of sampling operations for graph nodes varying from 2 to 10, experimental results of AUC and NDCG@10 increase slowly until obtaining a basically stable value, which illustrates the importance of multiple sampling operations for each node. Furthermore, our model has good robustness. Even if there are many sampling operations of each node, it has little impact on the overall performance of our model.

Fig. 8.

Impact of the Depth of Sampling Operation: From Figure 9, as the depth of the sampling operation varies from 2 to 8 for each node, the results of AUC and NDCG@10 slowly increase. Nevertheless, with the depth of the sampling operation increasing, the performance deteriorates. The cause might be that too many noisy neighboring nodes are contained.

Fig. 9.

Impact of the Aggregated Embedding Dimension: From Figure 10, when the aggregate embedding dimension d of each graph node varies between 50 and 350, the AUC and NDCG@10 generally increases. Nevertheless, as d is increased further, performance will slowly decrease, possibly owing to overfitting.

Fig. 10.

5.7 Qualitative Results

To instinctively illustrate the effectiveness of the IHGNN in inferring new users’ representations by utilizing the highly complex and rich relationships, such as relational information among related multimodal attributes of users and items, we visualize a new user with the user’s related attributes, sampling results from the M-HG and attention values of some related micro-videos, as shown in Figure 11. For each new user, based on the user’s attributes, we use the sampling operation to sample heterogeneous neighbors in the constructed heterogeneous graph, which can be regarded as their enriched information. In Figure 11, the list of mobile apps for the new user includes apps related to basketball and Amazon, which indicates the new user’s interests. Furthermore, this information can be utilized to sample related micro-videos, including basketball-related and Amazon-related micro-videos. After aggregation operation by the hierarchical attention aggregation network, attention values of different micro-videos are learned to generate the new user’s representation as shown in an attention values box. Note that, in the attention value box, the blue horizontal arrow shows the size of learned attention values of nodes, and the vertical arrow shows the node number of the graph M-HG. From Figure 11, we can observe that basketball-related and Amazon-related micro-videos are more important than other types of videos in terms of their learned attention values. Therefore, the IHGNN model can effectively infer new users’ representations based on relevant data information, which can assist in improving the performance of the user cold-start recommendation task.

Fig. 11.

6 Conclusions

In this work, we are committed to solving the problem of user cold-start recommendation. We argue that most existing GNN-based cold-start recommendation methods learn models based only on homogeneous graphs and ignore rich various (heterogeneous) relationships among different kinds of heterogeneous information in the user cold-start recommendation scenario. We propose a novel Inductive Heterogeneous Graph Neural Network (IHGNN) model, which can take advantage of the rich and heterogeneous relational information to alleviate the sparsity property of user attributes. Our model converts new users, items, and associated multimodal information into a Modality-aware Heterogeneous Graph (M-HG), which preserves the rich and heterogeneous relationship information among them. In addition, a well-designed multiple hierarchical attention-aggregation model consisting of the intra- and inter-type attention-aggregating module is proposed, focusing on useful connected neighbors and neglecting meaningless and noisy connected neighbors for learning more expressive representations. We evaluate our IHGNN method on three real datasets; experimental results in terms of all four metrics show that our proposed IHGNN model outperforms existing baselines in user cold-start recommendation tasks. In the future, we will try to focus on the expansion of the existing heterogeneous graph with knowledge graphs on the GNN models.

Footnotes

http://ai-lab-challenge.bytedance.com/tce/vc/.

https://www.kwai.com/.

http://grouplens.org/datasets/movielens/1m/.

References

[1]

Gediminas Adomavicius, Jesse C. Bockstedt, Shawn P. Curley, and Jingjing Zhang. 2021. Effects of personalized and aggregate top-N recommendation lists on user preference ratings. ACM Transactions on Information Systems 39, 2 (2021), 13:1–13:38.

Abstract

1 Introduction

2 Related Work

2.1 Cold-Start Recommendation Tasks

2.2 Graph Neural Networks

2.3 Heterogeneous Graph Neural Networks

3 The Proposed Algorithm

3.1 Problem Statement

3.2 Overall Framework

4 Methodology

4.1 Heterogeneous Graph Construction

4.2 Multiple Hierarchical Attention Aggregation Networks

4.2.1 Grouping Module.

4.2.2 Intra-type Feature-Aggregating Module.

4.2.3 Inter-type Feature Aggregating Module.

4.3 Model Optimization

4.4 Inductive Representation Learning for User Cold-Start Recommendation

4.5 Algorithm Description

5 Experimental Results

5.1 Datasets

5.2 Baselines

5.3 Experimental Settings

5.4 Quantitative Results

5.5 Analysis of IHGNN Components

5.6 Hyper-parameter Sensitivity

5.7 Qualitative Results

6 Conclusions

Footnotes

References

Cited By

Index Terms

Recommendations

Pairwise preference regression for cold-start recommendation

Jointly modeling content, social network and ratings for explainable and cold-start recommendation

Item Cold-Start Recommendation with Personalized Feature Selection

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations