Keywords

1 Introduction

As the consumer economy enters an era of “information overload” [1], a large number of personalized service platforms, including recommendation systems, begin to emerge to satisfy human’s more individualized demands. User profile, the fully understanding of users, is the basis of recommendation system [19] and exact-marketing [2, 3]. As a result, it is essential to provide an accurate and effective method to get user profiles.

Recently, user profile is widely studied by label propagation methods inferring user interests [4,5,6], user authority [21] and social attributes [20]. In order to get more accurate and abundant user profile, many researches prefer to apply multiple labels to analyze user profile. Some researches assumed labels were independent [5], ignoring the associations among them. However, it is not consistent with reality and cannot mine hidden label features very well. To overcome this limitation, Glenn et al. [1] considered the associations among labels to get user profile and obtained better performance. The associations in these works are explicit, indicating that there are some clear and definable connections among labels. For example, “Photography” and “Camera” are related, so they are called explicit association. As a result, a large number of methods based on explicit association were proposed and have achieved good performance [2, 22]. However, the existing methods rarely consider the implicit association of labels, where there are internal but not direct connections among them.

In many real-world applications, the associations among labels are complex [25]. With our observation, in addition to the explicit associations, there still exists some associations among implicit labels due to various reasons, such as uncertainty [23] or privacy issues [24]. For example, “Travel” and “health” are not related in any cases. However, they have many inherent connections, because people who like travelling always pay more attention to their health. Utilizing the implicit associations among labels, it is beneficial to make user profile more accurate and comprehensive.

To take advantage of this insight, a multi-label propagation method with implicit label associations (MLP-IA) is proposed to get user profile. We first design a probability matrix to record the implicit associations and then combine this probability matrix to multi-label propagation method to get more accurate user profile. Finally, we prove that the method is convergent and faster than traditional label propagation algorithms. To sum up, the main contributions of this study can be summarized as follows:

  • Insight. We present a novel insight about associations among implicit association labels. In social platforms, due to users’ social and living habits, there are still certain implicit associations among labels. At the same time, mining the associations is useful for the construction of user profile.

  • Method. A multi-label propagation method with implicit label associations is proposed to get user profile. We first design a probability matrix to record the implicit associations and then combine the multi-label propagation method with this probability matrix to get more accurate user profile. Finally, we prove that the method is convergent and faster than traditional label propagation algorithm.

  • Evaluation. We conduct experiments to evaluate our method on six real Weibo data sets of different sizes. The comparative experiments evaluate the accuracy and effectiveness of the proposed method. The results show our method can accelerate the convergence and the performance is significantly better than the previous methods.

The following chapters are organized as follows: In Sect. 2, related works are elaborated in details. The Sect. 3 explores our insights about the implicit association labels, and Sect. 4 describes the details of the proposed method and its efficiency. In Sect. 5, experiments and results are described. Finally, conclusions and future work are drawn in Sect. 6.

2 Related Works

The existing researches in user profile can be divided into two parts. One is to infer user’s unknown attributes based on the user’s own data by text-mining methods, and the other is to propagate labels by social-network structure.

2.1 Text-Mining Methods

There are many text-mining methods to extract user profile. The user’s own data generally contains rich semantic information, so the user profile problem can be regarded as a text analysis problem [7, 8]. For user’s interest profile, most researchers apply the topic model (LDA) to complete the keyword extraction on the blog, and then use TFIDF algorithm to select features [9,10,11]. However, text mining often has high complexity, and the extracted profiles are unstable because of the richness of semantics.

2.2 Social-Network Structure

The method is to label the unknown users based on the known users’ labels by propagation, and multi-label algorithms are widely applied. Zhang et al. used multi-label propagation algorithm to mine user interests, and discovered potential interests of users through social relationships [6]. Xie et al. proposed the speaker-listener mechanism to update the label [12]. Dong et al. considered inference of gender and age using a social network, which is feasible only when the set of attribute values is extremely restricted [13]. To approach this, Chakrabarti et al. [14] proposed a method called EDGEEXPLAIN and Besel et al. built interest profile by propagating activation functions, and proved the method was more suitable than the most advanced methods [15]. Ma et al. innovatively introduced label propagation to improve the accuracy of the semi-supervised learning algorithm [16].

Some scholars found there were links between labels. Recently, the explicit associations among labels have been taken into consideration, Glenn et al. [1] introduced the explicit association labels and the results showed that their method performed well. To the best of our knowledge, there are few researches on the associations among implicit association labels.

3 Priori Knowledge with Implicit Association Labels

As is observed, in many social platforms, because of hot spot events or other special reasons, there are certain associations among implicit association labels. For example, as is shown in Fig. 1, the node represents a user in Weibo and a directed edge indicates the user’s social relationship. As highlighted with orange labels, we find the majority of users who like Entertainment in Weibo like Health as well.

Fig. 1.
figure 1

An Example of interest labels propagation in Weibo. Note: nodes with labels indicate that users in Weibo where labels are the topic users are interested in. Green nodes are users with high influence such as V-plus users and orange nodes are some ordinary users. A directed edge indicates the user’s social relationship. Some explicit labels are highlighted with blue color. (Color figure online)

We analyze the statistical characteristics of user interest labels in Weibo by correlation analysis [26] and a higher value indicates that there are certain associations among implicit association labels. We have show the top five label pairs in Fig. 2 and the value is the correlation score of interest labels.

Fig. 2.
figure 2

Results of labels association analysis.

The statistical results are explicable. For example, Fig. 2 shows that there are some associations between Health and Tourism. In the real world, users who like healthy lifestyle tend to pay more attention to tourism information. Obviously, they can enrich their lives through tourism and develop a healthy life with a relaxing lifestyle.

Based on our observation, implicit associations exists among labels and the features can be fully utilized to build a better user profile. Therefore, we will introduce the priori probability among implicit association labels to improve user profile model. The details will be illustrated in next section.

4 Our Model

This section mainly focuses on the improved multi-label propagation. Firstly, we will construct the priori knowledge to introduce the associations among implicit labels. And then two major matrixes in multi-label propagation algorithm will be initiation for propagation. Next, the model will be trained via labeled users and we will get the unlabeled users’ label after a finite number of iterations.

The symbols mentioned in the paper are shown in Table 1.

Table 1. Symbols of our paper

4.1 Priori Knowledge of Implicit Association Labels

From Sect. 3, a new insight about associations among implicit association labels has been found. We analyze the interest labels of users in Weibo and find there is a certain connection among different interest labels.

Specifically, we define the priori probability matrix P as Eq. 1 shows. The higher the value of \( {\text{p}}_{\text{ij}} \) is, the higher the probability of propagation among labels becomes.

$$ {\text{p}}_{\text{ij}} = \frac{{\left| {\left\{ {t|t \in I \wedge \left( {l_{i} ,l_{j} } \right)\, \subseteq \,t} \right\}} \right|}}{Z} $$
(1)

where \( Z = \mathop \sum \nolimits_{i = 0}^{m} \mathop \sum \nolimits_{j = 0}^{m} \left| {\left\{ {t|t \in I \wedge \left( {l_{i} ,l_{j} } \right)\, \subseteq \,t} \right\}} \right| \). Some scholars have proved that the associations in social network are complex due to various reasons, such as uncertainty [23] or special events [24]. Therefore, we define that elements of \( {\text{I}} \) by co-occurrence, cultural associations, event associations or custom associations and so on. The detail is shown in Eq. 2.

$$ {\text{I}} = {\text{I}}_{1} \cup {\text{I}}_{2} \cup {\text{I}}_{3} \cup \ldots $$
(2)

where \( {\text{I}}_{\text{i}} \left( {{\text{i}} = 1,2,3, \ldots } \right) \) represent respectively a set of each user’s interest label set, label sets sampled by cultural associations, event associations and custom associations.

4.2 Introduction of the Priori Knowledge

As our observation, in addition to other users’ influence, the label will spread according to both other users and other labels, that is \( {\text{F}} = {\text{U}} \cdot {\text{L}} \cdot {\text{P}} \). The labels will be propagated among users and each iteration is given by Eq. 3, where \( \uplambda \) is a hyper parameter and it controls the influence of initialization.

$$ {\text{F}}^{{\left( {{\text{t}} + 1} \right)}} = \lambda T \cdot F^{\left( t \right)} \cdot P + \left( {1 - \lambda } \right)F^{\left( 0 \right)} $$
(3)

The hyper parameter \( \uplambda \) controls the influence of initialization. In each iteration, the users’ labels will be updated by the neighbor node’s labels and the implicitly associated labels. It’s noted that after each iteration, the matrix F will be corrected by \( {\text{F}}_{\text{a}} \) for next correct propagation, which shows that our model is a semi-supervised learning method.

The loss function in the model uses the squared distance, as is shown in Eq. 4.

$$ {\text{loss}}\,{ = }\,\left| {{\text{F}}^{{ ( {\text{t + 1)}}}} - {\text{F}}^{{ ( {\text{t)}}}} } \right|^{2} + \frac{1}{2}\left\| {1 - \zeta } \right\|^{2} $$
(4)

In real world network, it is difficult to construct complete structure of networks because of privacy security. Therefore, we consider the influence of the integrity of social networks. In the model, a hyper parameter \( \zeta \) is introduced.

We define \( \zeta = \frac{the\, number \,of\, relationships \,in\, dataset.}{the\, number\, of\, relationships \,in\, the\, real\, world.} \), which represents the sparsity of social networks and the value indicates the integrity of a social network. When more relationships are added for constructing graph, \( \zeta \) will tend to 1 and the model will get smaller loss value accordingly to make a better user profile.

4.3 Multi-label User Profile Based on Implicit Association Labels

Traditionally, given labels set \( {\text{L}} = \left\{ {{\text{l}}_{1} , \ldots ,{\text{l}}_{m} } \right\} \), \( {\text{U}} = \left\{ {{\text{u}}_{1} , \ldots , {\text{u}}_{a} , \ldots ,{\text{u}}_{a + b} } \right\} \), which contains a users with labels and b users without labels and their labels matrix \( {\text{F}} = \left[ {{\text{F}}_{\text{a}} {;}\, {\text{F}}_{\text{b}} } \right] \).

Firstly, if nodes are in a graph, multi-label propagation algorithm infers the labels via the aggregate labels of their neighbors until labels for all the nodes do not change [26]. The key to inference is the probability between nodes. In our model, two major matrixes, the user’s initial interest vector matrix and label transfer matrix, are constructed. The details of two matrixes are as follows.

Initial Interest Vector Matrix.

The initial interest vector matrix contains two parts. The one-hot method is used to build the labeled users’ initial vector of interest labels. As is shown in Eq. 5, if the user is with the label, the value will be 1. Otherwise, it will be 0.

$$ {\text{f}}_{\text{ij}} = \left\{ {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} \begin{array}{*{20}c} {{\text{if}}\,{\text{the}}\,{\text{user}}\,{\text{i}}\,{\text{is}}\,{\text{with}}\,{\text{the}}\,{\text{label}}\,{\text{j}}} \\ {otherwise} \\ \end{array} } \right. $$
(5)

The unlabeled users’ initial vector is zero vector. From this, we transform the multiple interest labels into the multi-label vector for propagation. It is worth noting that after each update iteration of label propagation algorithm, it is necessary to correct the labeled users’ interest label matrix so as to obtain more accurate results of propagation.

Probability Transfer Matrix.

The interest labels will spread among users. The label transfer matrix T is constructed via social relationships shown in Eq. 6. Here we introduce two matrixes D and W for better convergence, which had proved in [17]. Elements of D are computed by Eq. 7 and elements of W are computed by Eq. 8.

$$ {\text{T}} = {\text{D}}^{{ - \frac{1}{2}}} W{\text{D}}^{{ - \frac{1}{2}}} $$
(6)
$$ {\text{d}}_{\text{ii}} = \sum\nolimits_{j = 1, \ldots ,n} {w_{ij} } $$
(7)
$$ {\text{w}}_{\text{ij}} = \left\{ {\begin{array}{*{20}c} {c_{ij} \times {\text{Sim}}\left( {{\text{i}},{\text{j}}} \right)} & {i \ne j} \\ 0 & {i = j} \\ \end{array} } \right. $$
(8)

where \( Sim\left( {i,j} \right) \) indicates the similarity between user i and user j is expressed as is shown in the Eq. 9. That is, the less the ratio of the value is, the closer the distance is.

$$ {\text{Sim}}\left( {{\text{i}},{\text{j}}} \right) = \frac{1}{{\left| {\log \left( {\frac{{\left| {FANS_{i} } \right|}}{{\left| {FOLLOW_{i} } \right|}}} \right) - \log \left( {\frac{{\left| {FANS_{j} } \right|}}{{\left| {FOLLOW_{j} } \right|}}} \right)} \right| + 1}} $$
(9)

Next, we will train the model via labeled users and the loss function is defined in Eq. 4. After each iteration in Eq. 3, the model will stop until loss is less than threshold that we set. The method is proved to be convergent in Sect. 4.4. The specific algorithm flow is shown in Algorithm 1.

figure a

4.4 Analysis of Algorithms

Convergence Analysis.

The convergence of our method is shown as follows. Let the user’s label matrix be F. According to the labeled and unlabeled users, T can be divided into sub-matrices as is shown in Eq. 10, where subscript “a” indicates the user’s label is known and subscript “b” indicates the user’s label is unknown.

$$ {\text{T}} = \left[ {\begin{array}{*{20}c} {T_{aa} } & {T_{ab} } \\ {T_{ba} } & {T_{bb} } \\ \end{array} } \right] $$
(10)

From the section above, we can see that the core formula of the label propagation algorithm proposed in this paper is Eq. 3. P matrix represents the co-occurrence relationship of labels, and \( 0 \le {\text{p}}_{\text{ij}} \le 1 \). Fa is the interest label of the source user, which is fixed and invariant. Therefore, our method is simplified as \( {\text{F}}_{\text{b}} \leftarrow T_{bb} {\text{F}}_{\text{b}} P + T_{ba} {\text{F}}_{\text{a}} P \), which leads to Eq. 11.

$$ {\text{F}}_{\text{b}} = \lim_{{{\text{n}} \to \infty }} \left( {{\text{T}}_{\text{bb}} } \right)^{\text{n}} F_{b}^{\left( 0 \right)} P^{n} + \left( {\sum\nolimits_{i = 1}^{n} {\left( {T_{bb} } \right)^{{\left( {i - 1} \right)}} } } \right)T_{ba} F_{a} P^{n} $$
(11)

where \( F_{b}^{\left( 0 \right)} \) is the initial value of \( F_{b} \). Because T matrix is row-regular, Tbb is a submatrix of T, so it follows Eq. 12. Therefore, we can get Eq. 13.

$$ \exists \gamma < 1,\sum\limits_{j = 1}^{u} {(T_{bb} )_{ij} } \le \gamma ,\forall i = 1, \ldots ,u $$
(12)
$$ \begin{aligned} \sum\nolimits_{j} {(T_{bb} )^{n}_{ij} } & = \sum\limits_{j} {\sum\limits_{k} {(T_{bb} )^{(n - 1)}_{ik} (T_{bb} )_{kj} } } = \sum\limits_{k} {(T_{bb} )^{(n - 1)}_{ik} \sum\limits_{j} {(T_{bbb} )_{kj} } } \\ & \le \sum\limits_{k} {(T_{bb} )^{(n - 1)}_{ik} \gamma \le \gamma^{n} } \\ \end{aligned} $$
(13)

P is priori knowledge and \( 0 \le {\text{p}}_{\text{ij}} \le 1 \), which can accelerate convergence. And \( \left( {{\text{T}}_{\text{bb}} } \right)^{\text{n}} \) indicates the sum of each line converges to 0, from which we can conclude in \( \left( {{\text{T}}_{\text{bb}} } \right)^{\text{n}} F_{b}^{\left( 0 \right)} P^{n} \to 0 \). Thus the initial value of \( F_{b}^{\left( 0 \right)} \) is inconsequential. Obviously, \( {\text{F}}_{\text{b}} = \left( {{\text{I}} - {\text{T}}_{\text{bb}} } \right)^{ - 1} T_{ba} F_{a} \) is a fixed point.

Time Complexity Analysis.

In the initialization of the label propagation algorithm, we need to establish an initial label for each Weibo user, and the complexity of the process is O(n). In the propagation of the interest label, due to the convergence of our method, the iterations are fixed. And the time complexity of each iteration is O(m). Therefore, the entire algorithm is nearly linear in time complexity.

5 Experiments

5.1 Dataset

Like Twitter, Weibo is the largest social network platform in China. To prove the universality and effectiveness of our method, we evaluate our method in different scale data sets in Weibo.

Firstly, we randomly get six different sets of users and their social datas such as followers, fans and blogs etc. The scale of the data sets are illustrated in Table 2. And different data sets are collected at different times. Due to the limit of Weibo, we just get part of their following users and obtain sparse social relationships of users.

Table 2. Dataset of our paper

Next, according to the characteristics of Weibo, we artificially labeled users’ interest with interest labels based on their blogs and social relationships. The labeled users are selected if the user is marked with a “V” which means his identity had been verified by Sina. Analyzed by Jing et al. [18], these users were very critical in the propagation.

5.2 Comparisons and Evaluation Setting

To evaluate the performance of our method (MLP-IA), we compare it with other methods. Table 3 lists some compared baselines. We first compare with traditional Multi-Label Propagation (MLP) to evaluate the effectiveness of priori knowledge. Then we select Multi-Label Propagation Based on Explicit Association Labels (MLP-EA) to evaluate whether implicit association labels can perform better to mine the relationship than explicit association labels learning from [1]. In baseline three: Multi-Label Propagation Based on Explicit and Implicit Association Labels (MLP-EIA), we introduce associations among both explicit and implicit association labels to experiment, fully exploring the relationships among labels. Finally, to explore the impact of social relationship integrity on results, we make some new experiments in Sect. 5.4.

Table 3. Baselines of our paper

In the experiments, we will analyze the precision ratio and recall ratio of method which respectively represent the accuracy and comprehensiveness of user profile. And F1-Measure is a harmonic average of precision ratio and recall ratio, and it reviews the performance of the method.

5.3 Results and Analysis

The experiment results are shown in Fig. 3. From Fig. 3, we can see MLP has the highest precision ratio, and MLP, MLP-IA and MLP-EIA have stable result. However, as the size of data set increases, the precision ratio of MLP-EA decreases continuously. In the recall ratio, MLP-IA and MLP-EIA perform better than others and MLP have the worst result. Furthermore, in the F1-Measure, MLP-IA and MLP-EIA, introducing priori knowledge of association labels, achieve the best results.

Fig. 3.
figure 3

Results of the precision ratio, the recall ratio and the F1-Measure.

Compared with MLP, the results show that although the precision ratio of our method is slightly reduced due to the introduction of priori knowledge, the recall ratio has been greatly improved, and the F1-Measure has also been improved. Therefore, the results prove that the association among labels is effectively mined based on implicit association labels and the user’s interests can be well mined. But the recall ratio is less than MLP-IA. It indicates that although MLP can predict the interest label accurately, user profile is not complete and the convergence speed of model is rather slow.

In the MLP-EA, the associations among explicit labels was considered. The associations were calculated by word2vec before the application of label propagation. Results show that with the introduction of explicit associations among labels, the precision ratio basically remains unchanged and the recall ratio is improved especially in larger data sets, which proves the effectiveness of the explicit mining.

However, as the results show, our method performs better in recall ratio and F1-Measure. It can indicate association among implicit labels can perform better to mine the relationships among labels than explicit labels. In Weibo, posts are more arbitrary and it provides more features to make user profile. Nevertheless, there is much “noise” disturbing the results. Considering priori knowledge of association among implicit labels can avoid introducing textual “noise”. On the other hand, the semantics of posts are too diverse to mine the associations among implicit association labels well. Instead of considering too many explicit details, it is more beneficial to explore the associations among implicit association labels.

Furthermore, we consider both the explicit labels and implicit association labels in MLP-EIA. The results show that our method has similar performance with MLP-EIA. It can prove that our method MLP-IA, introducing the implicit association labels, can capture the feature of users deeply and make user profile well. As is mentioned above, the associations among explicit labels includes too many features which are positive or negative for model training. So the model could not identify them well. Therefore, user profile model can be well constructed only by introducing implicit associations among labels.

5.4 The Influence of the Social Relationship Integrity

To explore the impact of social relationship integrity on results, we conduct new experiments by adding more social relationships based on the same users’ sets. In addition, we consider to add relationships of “LIKE” and “RETWEET”, which also represents the interaction among users. The results are shown in Fig. 4. We can see that with the increase of known social relationship data, the performance of our method is gradually improving, especially in recall ratio. It proves that the model can identify more interest labels according to social relationship data.

Fig. 4.
figure 4

Further results: the influence of social relationship integrity.

The hyper parameter \( \zeta \) in Eq. 7 which represents the sparsity of social networks. When adding more social relationships, the model will work on the more complete social network graph. The param will be equal to 1 and lead to a smaller loss. On the other hand, the matrix T in Eq. 6 will be dense after adding more social relationships, and the model can capture more features among users after each iteration.

Therefore, based on more relationships among users, we can build a more complete social network graph accordingly and explore more information about these interaction. And the richer the social relationships is, the higher the recall ratio of interest profile is.

5.5 Convergence

To evaluate the performances between the introduction of explicit labels and implicit association labels, we experiment with the convergence of the iterative times and time consuming. And the results are shown in Fig. 5.

Fig. 5.
figure 5

The number of iterations and the iteration time. The left is the number of iterations and the right is the iteration time.

The results intuitively show that our method converges faster with the introduction of the priori knowledge. And as the data scale increases, the time complexity is still low and time consumption does not increase exponentially. In particular, by comparison, our method takes the least iteration time. Therefore, the introduction of associations among implicit association labels can accelerate the result convergence, and accordingly it can be used in more real-time and large-scale recommendation systems.

5.6 Summary

We compare our method with three baselines. The results show that our method can accelerate the convergence of propagation and make a significant increase in F1-Measure. However, baseline MLP-EA does not perform better in the experiments. The reason is, in the real world, explicit associations among labels include too many features which may be positive or negative for model training. The model could not identify them well. Instead of considering too many explicit details, it is more beneficial to explore the associations among implicit association labels.

Furthermore, our method achieves a similar performance with MLP-EIA. However, MLP-EIA takes much more time than our method. It proves that, although explicit associations among labels contains more features, it did not work in our model. Therefore, user profile could be well constructed by only introducing implicit associations among labels.

In addition, we explore the impact of social relationship integrity on results. The results show that, with the increase of known social relationship data, the performance of our method is gradually improving and it can identify more interest labels according to social relationship data.

6 Conclusion and Future Work

In this paper, we have studied the user profile by multi-label propagation. We proposed an improved multi-label propagation algorithm to utilize implicit association among labels and the implicit association labels can demonstrate more relationships among users. The experiments based on six real-world Weibo datasets have shown that our method accelerates the convergence and gets better performance than the previous methods.

Future work will pay more attention to improve the recall ratio of our method by extending the social relationships.