1 Introduction

Collaboration between researchers, mainly presented by means of co-authorship, is at the heart of most research advances for higher productivity, exchanging ideas, acquiring expertise, resources, etc. Previous studies (Wuchty et al. 2007; Sonnenwald 2007) confirm that collaborative authors can produce more influential research than solo authors. With the rapid development of information and communication technologies (ICT), research collaborations are no longer costly or complex, as illustrated in Lee et al. (2009) by “laboratories without wall”. However, the involvement of millions of researchers, papers and venues in big scholarly data leads to the phenomenon of information overload, which renders a big challenge for researchers, especially junior ones, to find the most appropriate new coauthors outside their present research team for fresh ideas or high-impact publications. Most of the existing academic search systems, such as Google Scholar, CiteSeerX, and AMiner, mainly provide paper retrieval services or expert recommendation, but rarely suggest personalized coauthors to whom the target author will probably connect in the future. In addition, collaborator recommendation still remains a frequently studied problem along the line of link prediction task. Facing the massive and heterogeneous scholarly data, some recommender systems are developed to solve various problems existing in the academic environment (Yu et al. 2012; Chen et al. 2015; Asabere et al. 2015).

To find potential individuals for future collaboration, it is necessary to identify some factors affecting the formation of research collaboration. Most previous studies focus on calculating the vertex similarity between a researcher pair in a co-author network with some academic factors. For example, CollabSeer (Chen et al. 2011) analyzed both the structure of a co-author network and an author’s research interests for collaborator recommendation. Brandão et al. (2013) proposed to use two metrics considering the homophily and proximity principles respectively according to the researchers’ institutional affiliation and geographic location for recommending new collaborators or intensification of existing ones. MVCWalker (Xia et al. 2014) explored three academic factors, i.e., coauthor order, latest collaboration time, and times of collaboration in a random walk-based method to find the most valuable collaborators (MVCs). Chaiwanarom and Lursinsap (2015) utilized not only three common factors concerning social proximity, friendship, and complementarity skills, but also three new factors related to research interests, up-to-date publication data, and seniority of researcher to recommend collaborators under interdisciplinary environment. Subsequently, some supervised methods were proposed using multiple features extracted from link property dimension and/or semantic dimension (Al Hasan et al. 2005; Lichtenwalter et al. 2010; Yang et al. 2012).

As far as we know, the mechanisms behind link building and the most relevant factors deciding the co-authorship are still unclear because of the dynamic and sparseness of the network. In academic community, new papers and authors emerge quickly over time, signifying the appearance of new interactions in the underlying social structure. Unlike other social networks such as Facebook where a node will be removed if two persons are no longer friends, the number of nodes and edges in co-author networks is rising steadily without discarding any element. Furthermore, co-author network is a very sparse network. Statistical data from DBLPFootnote 1 shows that almost 80 percents of people have less than 10 coauthors among 1,708,561 authors and 3,310,725 publications up to April 2016. Thus, finding collaborators in such a large, sparse, and dynamic network is a hard and time-consuming task.

We study the problem of collaboration recommendation from the following aspects. First, it is investigated in a heterogeneous bibliographic network containing multi-typed entities and links, which is a general data format of knowledge graphs. The heterogeneity and rich-relation nature can provide us more information about the network structure and semantic, and would even be of great help in mining the knowledge hidden in the network such as network evolution, link prediction, and anomaly detection. Second, there are many features that can affect the co-authorship explicitly or implicitly. For example, two authors belong to the same institution or attend the same conference. We think the features, already discovered or still unknown, are all expressed with the aid of papers at some point. Since each link in the original network is associated with a paper node, we can regard the related paper node as the attribute of the other node for simplifying network. Then, we modify the network structure by removing the citing paper nodes and keeping the cited papers, and use collaboration time as a key factor to help the prediction. Finally, a RWR-based algorithm is proposed to measure the importance of nodes in the network, as it provides a simple way to define the “distance” from a target node a to every other node.

The rest of this paper is organized as follows: Sect. 2 introduces formally the definitions of heterogeneous bibliographic network and collaborator link prediction. Section 3 discusses details of the proposed method RWR-CR. Experiments and results among baseline methods are described in Sect. 4. Section 5 presents the related work, then Sect. 6 concludes the work.

2 Problem definition

This section presents a concise description of heterogeneous bibliographic network followed with the collaborator prediction problem.

2.1 Heterogeneous bibliographic network

The heterogeneous bibliographic network is built based on the real-world biliographic dataset containing rich publication information including authors, paper titles, publication date, venues (journal or conference), and citations. For brevity, only the words from paper titles are taken as terms with stop words removed.

There are four types of nodes in the resulting network, i.e., authors, papers (which are split into two classes: citing papers and cited papers), terms, and venues, abbreviated as A, P, T and V respectively. Links between all kinds of nodes represent various symmetric relations, such as “write” and “written by” between authors and papers, “cite” and “cited by” between papers and papers, “contain” and “contained in” between papers and terms, “publish” and “published by” between papers and venues. Figure 1 gives a brief illustration of the network schema for a heterogeneous bibliographic network. The concept of network schema defined in Sun et al. (2011b) is a network structure description at meta level for a better understanding of the studied network, which plays a role similar to the ER (Entity–Relationship) model in database systems.

Fig. 1
figure 1

Schema for heterogeneous bibliographic network

2.2 The collaborator prediction problem

Generally, the link prediction problem can be broadly formulated as follows: given a snapshot of a social network at time t and a disjoint node pair (xy), predict whether the node pair has a relationship, or in the case of dynamic interactions, will form one in the near future time \(t'=t+\varDelta t\) (Liben-Nowell and Kleinberg 2003). Given a heterogeneous bibliographic network, collaborator recommendation aims to predict to whom the query author will build a co-authorship relation in the future. It is noteworthy that we are interested in predicting new co-authors, rather than the co-authored ones, to the target researchers using some features intrinsic to a heterogeneous bibliographic network.

To solve this problem, we first collect a raw dataset during a past time interval \(T_{\textit{data}}=[t_{\textit{start}},t_{\textit{end}}]\) and split them into two parts: first time interval \(T_{\textit{first}}=[t_{\textit{start}},t_{\textit{current}}]\) and second time interval \(T_{\textit{second}}=(t_{\textit{current}},t_{\textit{end}}]\). The data in the former part is used to predict the potential links in \(T_{\textit{second}}\), while the latter part containing ground truth is used to validate the effectiveness of the prediction. We assume that \(T_{\textit{second}}\) is unknown to the authors in \(T_{\textit{first}}\) and set the last year of \(T_{\textit{first}}\) as the current year \(t_{\textit{current}}\). Next, we construct a heterogeneous bibliographic network based on prior interval data and apply modification rules to simplify the network structure. In other words, the succeeding interval data just needs to be organized in a homogeneous manner because only the co-authorship among authors in the network concerns. Then, with the topological structure and the rich entity information, we can define several importance measures to capture the importance of a node in the modified network, which can also be used to judge the weights of each link. After computing the relevance score of each link, we apply the proposed RWR-based approach to the heterogeneous network to generate an ordered recommendation list for a target node.

It is worth noting that the goal of this paper is to recommend potential collaborators to a researcher. Therefore, it is necessary to know the various representations of co-authorship. Here, we use the concept of meta path depicted in Sect. 3.1 and develop modification rules over them.

3 Proposed method

In this section, we will introduce a method named RWR-CR (standing for Random Walk with Restart-based Collaborator Recommendation) to solve the collaborator recommendation problem. This method includes three steps: (1) heterogeneous bibliographic network construction; (2) edge weighting; (3) random walk with restart.

3.1 Heterogeneous bibliographic network construction

The collaborator prediction problem can be considered as choosing the most appropriate paths connecting target researcher with candidate ones. After fully observing the network schema, we can see that the paper node plays a central role in the network interconnecting, and one author would fail to reach another author without passing through a paper node in any path. This inspires us to simplify the co-authorship relation paths by treating paper nodes as attributes to the related nodes and then removing them, resulting in a much more clear network structure. Besides, from the perspective of random walk, it can be seen that: (1) by random walk with restart, the score of a node generally decreases as it gets far away from the start node; (2) paths including paper nodes can be long; (3) from (1) and (2), the score of an author node can be low if it is connected to the start node via paper nodes; (4) however, the score should not be low in such a case due to the importance of paper nodes; and (5) thus, paper nodes should be removed. In further observations, the paper node is divided into citing paper and cited paper since there are “cite” and “cited by” relations between papers in Fig. 1. What we discuss above is about citing paper, so the citing papers are removed and the cited papers are kept in the final network using some modification rules defined as follows.

Along with the works in Sun et al. (2011a, b), we consider paths following different meta paths with length less than 5. A meta path is a path defined on the network schema, which denotes a concatenated relation between two nodes. Intuitively, different meta paths can capture different semantics in a heterogeneous network. For example, the meta path \(A{\longrightarrow} P{\longrightarrow} A'\) describes the co-author relation between two authors A and \(A'\), which may be abbreviated as AP\(A'\) in case there is no ambiguity in understanding the meaning of the path. Furthermore, the longer the path, the more loose the connection between two ends of the path. Under the constraint of a maximally allowed path length, we then have 10 topology paths with author node at each end. All the paths are categorized into two types: direct links and indirect links, depending on whether two authors are connected to each other directly or through other types of nodes.

Furthermore, we define the modification rules on the aforementioned 10 meta paths for path simplification. The simplification focuses on reducing the number of nodes in the network by deleting those unnecessary citing paper nodes. In the following, we use P, \(P'\), and \(P''\) to distinguish three different papers respectively, so do A, \(A'\), and \(A''\).

  1. (1)

    AP\(A'\) converts to A\(A'\). The former meta path means two authors have co-authored the same paper. Since the paper is the same for the two authors, we can link them directly without an immediate node as the example illustrated in Fig. 2a. The modified path is much more intuitive to our understanding for co-authorship.

  2. (2)

    AP\(A'\)\(P'\)\(A''\) converts to A\(A'\)\(A''\). The former meta path means that the left author A and the right author \(A''\) are co-authors of the same author \(A'\). Using rule (1), the path can be modified to the latter one, which is much more concise and intuitive. An example is given in Fig. 2a to illustrate that Andy and Jim are both co-authors of Carl.

  3. (3)

    A\(P{\longrightarrow} P'\)\(A'\) converts to A\(P'\)\(A'\). The former meta path means the left author A cites the paper written by the right author \(A'\). Likewise, we revise the path by ignoring the citing paper and keeping the cited paper. Therefore, the path AndyP2—Jim appears in Fig. 2b. It is worth noting that we only focus on the cited papers rather than all the papers. The reasons for this choice consists in two aspects: the amount of cited papers and the role of bridge they played.

  4. (4)

    A\(P\longleftarrow P'\)\(A'\) converts to AP\(A'\). The former path can be seen as the reverse of the former one in rule (3). Similar to the prior rule, it can be converted by removing the citing paper \(P'\). In addition, only one rule is applied to each path rather than using multiple rules iteratively. For example, when AP\(P'\)\(A'\) converts to A\(P'\)\(A'\), it will not get converted via rule (1) to A\(A'\).

  5. (5)

    APV\(P'\)\(A'\) converts to AV\(A'\). The former meta path means that two authors publish papers in the same venue. According to the semantic meaning of the path, a very important node connecting two authors is the venue node, and it is possible to build a new path by excluding the two paper nodes. Also, in Fig. 2c we see the new path AndyKDDJim.

  6. (6)

    APT\(P'\)\(A'\) converts to AT\(A'\). The former meta path means two authors write different papers with the same term(s). Equally, we change the path into the latter form with two paper nodes removed, as seen in Fig. 2d.

  7. (7)

    A\(P{\longrightarrow} P'{\longrightarrow} P''\)\(A''\). This path means that author A cites paper (\(P'\)) that cites paper (\(P''\)) of author \(A''\). If we directly remove the citing paper P or \(P'\) along the path, we would lose significant author information of the removed node \(P'\). So, we decompose it into two parts: A\(P{\longrightarrow} P'\)\(A'\) and \(A'\)\(P'{\longrightarrow} P''\)\(A'\), and revise both via rule (3).

  8. (8)

    A\(P{\longrightarrow} P'\longleftarrow P''\)\(A''\). This path means that two authors (A and \(A''\)) cite the same paper (\(P'\)). Similar to the above rule, it decomposes into two parts: A\(P{\longrightarrow} P'\)\(A'\) and \(A'\)\(P'\longleftarrow P''\)\(A''\), both of which are modified via rule (3) and rule (4) respectively.

  9. (9)

    A\(P\longleftarrow P'{\longrightarrow} P''\)\(A''\). This path means that two authors (A and \(A''\)) are cited by the same paper (\(P'\)). We decompose it into two parts: A\(P\longleftarrow P'\)\(A'\) and \(A'\)\(P'{\longrightarrow} P''\)\(A''\), and revise them according to rule (4) and rule (3) respectively.

  10. (10)

    A\(P\longleftarrow P'\longleftarrow P''\)\(A''\). This path means that author A is cited by the paper (\(P'\)) which is cited by the right author’s paper (\(P''\)). Again, we decompose it into two parts: A\(P\longleftarrow P'\)\(A'\) and \(A'\)\(P'\longleftarrow P''\)\(A''\), and revise both of them by rule (4).

Fig. 2
figure 2

Rules for path simplification. a Meta path AP\(A'\) and an example. b Meta path AP \({\longrightarrow}\) \(P'\)\(A'\) and an example. c Meta path APV\(P'\)\(A'\) and an example. d Meta path APT\(P'\)\(A'\) and an example

Given the above rules, the bibliographic network can then be defined as an undirected graph \(G=(V,E)\), comprising a vertex set V and an edge set E. Here \(V=V_a\cup V_p \cup V_t \cup V_v\) is the set of various entities in the network, where \(V_a\) is the author node set, \(V_p\) is the cited paper node set, \(V_t\) is the term node set, and \(V_v\) is the venue node set. The nodes in set \(V_a\) are called author nodes, and those in set \(V_p \cup V_t \cup V_v\) are referred to as attribute nodes. Each edge \(e\in E=\{(A,A),(A,P),(A,T),(A,V)\}\) in graph G denotes a certain relation between author and any types of nodes. Meanwhile, every node has a time attribute corresponding to the related papers’ year of publication, e.g., Jim would record “2015” twice if and only if he wrote two papers at that year.

3.2 Edge weighting

The model RWR-CR is inspired by the Least Recently Used (LRU) page replacement algorithm in a computer operating system, in the spirit that pages heavily used in the past few instructions are also most likely to be used heavily in the next few instructions. In other words, a recent past would be more valuable than a past far away in predicting the future. We expand it to an academic scenario, where two researchers continuously co-authoring in a short time interval or collaborating lately are desired to co-operate again in a near future. Therefore, the weights assigned to each link connecting one author and the related node can be calculated in terms of two components: how often the last two collaborations occur and how far away is the recent collaboration from now. Specifically, two importance measures are employed to evaluate the importance of other nodes to the target node, and weights of each link in the network are calculated based on them accordingly.

(1) Sequence Importance Measure It is valuable to measure how often author a has a relationship (i.e., write, cite, contain or publish) with node x. For example, the relation between an author and an item (e.g., venue, term, or paper) would be more close if the author use this item twice in a year than the case with twice in 5 years. The sequence importance measure S(ax) of a node \(x\in V=V_a\cup V_p\cup V_t \cup V_v\) relative to an author node a is defined as

$$\begin{aligned} S(a,x)=\frac{1}{\phi (P_x^a)+1}, \end{aligned}$$
(1)

where \(P_x^a=\{\) descending publication years of papers by a and related with node x | \((a,x) \in E, x \in V\}\) .The function \(\phi (P_x^a)=P_x^{a(0)}-P_x^{a(1)}\) returns the difference between the first two largest years (\(P_x^{a(0)}\) and \(P_x^{a(1)}\)) in the year sequence \(P_x^a\). For example, if the author a has published three papers containing the term x in 2005, 2008, and 2010 respectively, then \(P_x^a=\{2010,2008,2005\}\) and \(\phi (P_x^a)=2\). Since the length of \(P_x^a\) may be 1, i.e., node pair (ax) has relationship between each other only once, we set the year \((t_{\textit{start}})\) dataset started as the second year (in this paper, 2001 for DBLP, and 1996 for hep-th). We add 1 to the denominator to avoid the division by zero. It is reasonable that we use the recent collaborations rather than the whole year sequence because researchers may change their institutional affiliations or research interests. It is possible that researchers would lose the collaborations with their supervisors after graduation. Also, they are probable to establish a collaboration with unfamilar researchers on a new topic. So, the recent past interactions of a node pair are what we mainly concerns.

Note that, for each node pair (ax), the measure S(ax) can be interpreted differently according to the type of x, which can be either an author, a venue, a cited paper or a term. If \(x \in V_a\), the measure S(ax) is calculated with the years in which the two authors co-authored. If \(x \in V_p\), S(ax) measures the duration of two citations of author a citing paper x. If \(x \in V_v\), it measures the duration for the last two papers author a published at venue x.

This sequence importance measure captures the importance of any node x relative to author a with the time difference between the last two relations established. Furthermore, the weights assigned to each link bias the following random walk process such that it will traverse the most relevant nodes more easily. Intuitively, it is clear that the longer the duration is, the more loose relationship the node pair (ax) has, which implies less effects of the node x on the node a’s behavior. It also seems reasonable to infer that author a is less productive and has a lower probability to publish new papers in the future.

(2) Freshness Importance Measure Aside from the duration measure, a freshness importance for a node pair (ax) based on the freshness of their newly written paper with respect to the current year \(t_{\textit{current}}\) (here, 2010 for DBLP, and 2000 for hep-th) is derived. The freshness importance measure of node \(x \in V\) relative to an author a is

$$\begin{aligned} F(a,x)=\frac{1}{t_{current}-P_x^{a(0)}+1}, \end{aligned}$$
(2)

where \(t_{\textit{current}}=``year\ of\ T_{\textit{first}} \ interval\ ended\)”, and \(P_x^{a(0)}\) is the largest year in the year sequence \(P_x^a\), namely the latest year.

According to the definition, it is clear that if the relationship between node a and x happens more recently, the score F(ax) is larger, and the chance they co-operate again will be high. For example, if an author a and another author x have co-authored a paper in 2008, then the score is 0.333. Likewise, for node \(x \in V_t \cup V_v \cup V_p\) , the measure F(ax) means the freshness of the last paper authored by a containing the term x or published in the venue x, or the freshness of the last time the author a citing paper x respectively.

Given a bibliographic network \(G=(V,E)\), we now describe how to assign weights to each link based on the measures discussed above. First, the definition of aggregate relative importance scores of a node pair is given as follows:

$$\begin{aligned} w(a,x)=\alpha S(a,x)+(1-\alpha ) F(a,x). \end{aligned}$$
(3)

It is clear that the score is a combination of the above two measures, whose contribution to the total score is adjusted by a parameter \(0\le \alpha \le 1\). Specifically, when \(x \in V_p\) is a cited paper, it is more complicated to distinguish three citation scenarios. If an author a wrote a paper x and cited it, then \(F(a,x)=1\). If an author a wrote a paper x but never cited it, then \(F(a,x)=0.5\). If an author a cited a paper x written by others, we use the definition in Eq. (3).

Then, a further normalization process is performed to each edge in the network, which is defined as follows:

$$\begin{aligned} M(a,x)= &{} {\left\{ \begin{array}{lll} \frac{\lambda w(a,x)}{\sum _{x' \in N_t (a)} w(a,x')} &{} \quad if \ |N_t (a)|> 0 \ and \ |N_u(a)|>0\\ \frac{w(a,x)}{\sum _{x' \in N_t (a)} w(a,x')} &{} \quad if \ |N_t (a)|>0 \ and \ |N_u(a)|=0\\ 0 &{} \quad otherwise, \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} M(a,a')= & {} {\left\{ \begin{array}{ll} \frac{(1-\lambda )w(a,a')}{\sum _{a'' \in N_u(a)} w(a,a'')} &{} \quad if\;|N_u(a)|>0 \;and\;|N_t(a)|>0\\ \frac{w(a,a')}{\sum _{a'' \in N_u(a)} w(a,a'')} &{} \quad if\;|N_u(a)|>0\;and\;|N_t(a)|=0\\ 0 &{} \quad otherwise, \end{array}\right. } \end{aligned}$$
(5)
$$\begin{aligned} M(x,a)= & {} \frac{w(x,a)}{\sum _{y \in N(x)} w(x,y)}, \end{aligned}$$
(6)

where node a and \(a'\) are author nodes, node x is an attribute node. \(N_u(a)\) is the set of author nodes connected to author a, \(N_t(a)\) is the set of attribute nodes connected to author a, N(x) is the set of all nodes connected to node x regardless of node type. The parameter \(\lambda \) controls the extent to which the direct link or the indirect link contributes much to the final performance.

3.3 Random walk with restart

To predict well potential collaborators, a method based on Random Walk with Restart named RWR-CR is proposed. RWR provides a good proximity score defined as a steady-state probability between two nodes in a weighted graph, and it has been successfully used in numerous applications. There are two reasons behind the success of RWR: (1) Since it is a very useful mathematical framework allowing to provide importance of each node in a network systematically, RWR has received increasingly interests from both applications and theoretical studies. Moreover, Many existing graph-based collaborator recommendation systems were built upon the RWR algorithm and proved to be very effective. (2) RWR is a model utilizing nodes and structure information simultaneously without losing information, and the random walk process may be optimized through guiding it to the more relevant nodes in the network.

Given a heterogeneous bibliographic network, a random walker’s behavior can be formalized as:

  • The walker starts to surf from author a.

  • With the probability c, the walker moves to a random neighbor, x, according to the edge weight M(ax). The parameter c is a damping factor introduced to control the probability of the walker moving ahead or jumping back. The score M(ax) represents the chance of walking from a to x, similar to an element \(m_{ax}\) in the transition matrix of PageRank algorithm, which is defined as the ratio of the right link weight w(ax) to the total weights of edges from a to nodes in the same type of x (i.e., attribute node or author node). In addition, we use parameter \(\lambda \) to control how much we rely on direct link or indirect link as illustrated in Eqs. (4) and (5).

  • With the probability \(1-c\), the walker returns to a.

  • When at an attribute node, the walker can only move to an author node through indirect links with the score M(xa) defined in Eq. (6).

  • After repeating the steps many times, the probability of the walker arriving at any node x in the network converges to a stationary weight indicating the importance of x to a.

Finally, we define the random walk process to calculate the link relevance for a target author \(a^\star \) as follows:

$$\begin{aligned}&\begin{array}{rl} r_a^{(t)}&=c\left( \sum \nolimits _{y\in N(a)}M(y,a)r_y^{(t-1)}\right) +(1-c)r_a^{(0)}\\ &=c\sum \nolimits _{a'\in N_u(a)}M(a',a)r_{a'}^{(t-1)}\\ &\quad +\,c\sum \nolimits _{x'\in N_t(a)}M(x',a)r_{x'}^{(t-1)}+(1-c)r_a^{(0)}, \end{array} \end{aligned}$$
(7)
$$\begin{aligned}&\begin{array}{rl} r_x^{(t)}&= c\sum \nolimits _{y\in N(x)}M(y,x)r_y^{(t-1)}\\ &=c\sum \nolimits _{a'\in N_u(x)}M(a',x)r_{a'}^{(t-1)}. \end{array} \end{aligned}$$
(8)

Here \(r_a^{(t)}\) represents the rank score of node a after the \(t^{th}\) iteration, which is the quantized importance or the probability of author \(a^\star \) to author a, while \(r_x^{(t)}\) is the random walk probability from author \(a^\star \) to attribute node x at iteration t. The initial rank score vector \(r^{(0)}\) assigns 1 to the target node and 0 to the remaining elements. In other words, \(r_a^{(0)}=1\) if \(a=a^\star \), and \(r_a^{(0)}=0\) otherwise.

The proposed algorithm, RWR-CR, is depicted as follows.

Algorithm 1   Collaborator recommendation using random walk with restart

Input:

A bibliographic dataset containing authors, paper titles, year of publication, and venues during \(T_{\textit{first}}\); target author \(a^\star \); three parameters \(\alpha ,\lambda ,\) and c;

Step 1:

Data cleaning. Extract all types of nodes defined above, filter out the isolated authors and keep the authors who have been continuing to write papers, and construct a heterogeneous bibliographic network \(G=(V,E)\) based on the modification rules described above from the original dataset;

Step 2:

Edge weighting. Calculate the weight of each edge in G using the sequence importance and freshness importance measures along with the parameters \(\alpha \) and \(\lambda \);

Step 3:

Parameter initialization. Create a rank score vector r and set all its elements to be 0 except setting the element \(a^\star \) to 1;

Step 4:

Random walk. Start the random walk from node \(a^\star \), and update r iteratively until convergence;

Step 5:

Recommendation. Sort the final score list \(r^{(t)}\) in decreasing order, select the top N author nodes \(a'\) with \(a'\notin N_u(a^\star )\) as the recommended candidates for \(a^\star \);

Output:

The top N recommended potential co-authors for author \(a^\star \).

4 Experimental results

In this section, we compare our collaborator prediction approach along with 8 state-of-the-art link prediction methods on two bibliographic networks. After that, we make some analysis and discussion.

The following experiments are performed on a laptop with 64-bit Windows 7 operating system, Intel i7-3540M CPU @3.00 GHz, and 8 GB memory. All the programs are implemented with Python.

4.1 Dataset

We use two real datasets to demonstrate the effectiveness of our method. The first dataset consists of two parts: the DBLP-citation networkFootnote 2 provided by Tang et al. (2008) is used as \(T_{\textit{first}}=[2001,2010]\) consisting of citation relationships, and the DBLP networkFootnote 3 is used as \(T_{\textit{second}}=[2011,2015]\) interval due to the absence of citations. The publication data collected from both two parts cover 20 conferences across four fields (database, data mining, artificial intelligence, information retrieval) during \(T_{\textit{data}}=[2001,2015]\). After data cleaning and removing papers with single authors, we get 4876 concurrent authors appearing simultaneously in both intervals from 28,311 publications. Then, a modified heterogeneous graph containing 4786 terms, 28,511 cited papers, and 20 venues is built. The attribute nodes with degree one is not counted. Although we only use a subset of the whole DBLP due to hardware constraints, it suffices to represent various practical co-authorship scenes since authors coming from four fields and 20 conferences not only have varied research interests and research issues, but also may have interdisciplinary collaborations. Moreover, it is common to practice with sampling in many studies.

The second datset is hep-th (Theoretical High Energy Particle Physics), a portion of arXiv provided by KDD Cup 2003.Footnote 4 After data preprocessing as above, we get 1922 concurrent authors from 21,450 publications. Next, a modified heterogeneous network containing 2442 terms, 12,431 cited papers, and 70 journals is constructed. This dataset keeps all the journals except the anomaly and the one with tiny degree.

All collaborative authors in the datasets can be divided into four categories in terms of publishing lifecycle: newcomers who just start to write co-authored papers, temporaries who publish only several collaborative papers and then quit, continuants who co-authored papers over a long period of time, and terminators who yet stop publishing collaboratively (Bukvova 2010). In this paper, we prefer continuants as the target nodes for two reasons: one is their great possibility to publish collaborative papers in the future, the other is their role in connecting the remaining three categories as intermediaries.

During the data preparing phase, we abandoned those authors with no co-authored paper in both the two intervals, as they are information islands away from major authors in the network and will make troubles in the following random walk process, to confine the author nodes more reachable to each other in the graph. Note that, in the original dataset, not every paper has complete reference list and not every cited paper has published in the limited venues or time interval, resulting in inevitable information loss and will be ignored. The statistics about the datasets are shown in Table 1. The data in \(T_{\textit{second}}\) is used to validate the predictions of \(T_{\textit{first}}\), so it is a homogeneous network with only one type of nodes (authors) and one type of links (co-authorship). Furthermore, we split the authors of \(T_{\textit{first}}\) in halves: one half for parameter selecting, the other for testing our method.

Table 1 Statistics of the experimental datasets

4.2 Evaluation metrics

To quantitatively evaluate the proposed method, for all target nodes we use historic collaboration data (\(T_{\textit{first}}\)) for predicting and the following years (\(T_{\textit{second}}\)) for validation. Since there is a strong relationship between collaboration and productivity, authors will collaborate with someone old or someone new. In evaluation, we consider those candidates who have never co-authored with target node in \(T_{\textit{first}}\) and our task is to predict whether they will develop collaboration later. If the system recommends a collaborator and then the relationship really has been built, we say the system makes a correct recommendation; otherwise a wrong recommendation. Based on this, we evaluate the quality of the collaborator recommendations for both the proposed and the baseline algorithms in terms of the following two metrics of information retrieval performance.

(1) RBP (Rank-Baised Precision) RBP (Moffat and Zobel 2008) is a evaluation metric for ranked retrieval, which is suitable for evaluation with incomplete relevance data (for more information see (Sakai and Kando 2008; Sakai 2014)). It assumes that ranking scoring always starts by examining the top-ranked item of the list, progressing from one to the next with a (persistence) probability p, and, conversely, ends its examination of the ranking at a point with probability \((1-p)\). Each termination is decided independently of the current depth reached, of previous decisions, and of whether or not the item just examined was relevant or not. For a list of length d, the rank-biased precision metric is defined as \(RBP=(1-p)\sum _{i=1}^d {rel_i \cdot p^{i-1}}\), where \(rel_i \in \{0,1\}\) is the relevance judgement of item i in the ranking, and the \((1-p)\) factor is used to scale the RBP within the range [0,1]. In other words, the metric assigns relevance weights based on the geometric distribution for parameter p, where a smaller p value, corresponding to impatient users, places greater emphasis on items that appear early in the ranking, and a larger p, corresponding to patient users, spreads the weight further down the ranking, with all items in the ranking contributing to the final score in both cases.

(2) RR (Reciprocal Rank) RR can be thought of as a scoring regime with a tractable user model. It assumes that items are examined starting with the first, until a relevant one is found. Thus, for a ranked list that does not contain a relevant item, let \(RR=0\). Otherwise, let \(rank_i\) be the rank of the first relevant item in the ranked list, then \(RR=1/rank_i\). For example, if the first relevant item appears at rank 1, then \(RR=1\). If it is at rank 10, then \(RR=1/10\). In other words, the user this measure models has a very strong preference for a relevant item at rank 1.

4.3 Baseline methods

We compare eight methods for potential collaborator recommendation. The first four are common traditional methods based on similarity of node neighborhoods.

(1) Common Neighbors (CN) \(S_{x,y}^{CN}=|\varGamma (x)\cap \varGamma (y)|\), where \(\varGamma (x)\) denotes the neighbor set of node x. The more neighbors two nodes have in common, the more likely they are to link to each other.

(2) Jaccard’s Coefficient \(S_{x,y}^{Jaccard}=\frac{|\varGamma (x)\cap \varGamma (y)|}{|\varGamma (x)\cup \varGamma (y)|}\). It is a normalization of common neighbors, and measures the probability with which a common neighbor of author x and y is randomly selected from \(\varGamma (x)\cup \varGamma (y)\).

(3) Adamic/Adar (AA) \(S_{x,y}^{AA}=\sum _{z\in \varGamma (x)\cap \varGamma (y)}\frac{1}{\log {|\varGamma (z)|}}\). The weighting schema is the reverse log frequency of node x and node \(y's\) occurrence, which refines the counting of common neighbors by assigning the less-connected neighbors more weights, and reversely the more-connected neighbors with less weights.

(4) Resource Allocation (RA) \(S_{x,y}^{RA}=\sum _{z\in \varGamma (x)\cap \varGamma (y)}\frac{1}{|\varGamma (z)|}\). It is clear that AA and RA are similar to each other, except for the strength of punishing the high-degree common neighbors, as RA takes the form of \(\varGamma (z)^{-1}\) more heavily than AA’s \((\log {\varGamma (z)})^{-1}\).

In addition, we use other four Random Walk with Restart (RWR) variants as baseline methods to compare with our proposed method.

(5) RWR-Homogeneous This is a basic RWR on a homogeneous network (i.e., a co-author network here). The homogeneous network has only one type of nodes (authors) and one type of links (co-authorship).

(6) RWR-Heterogeneous This is a basic RWR on the original heterogeneous bibliographic network with every link weighted uniformly. In other words, a random walk process runs on a more complicated network with heterogeneous nodes and links.

(7) MVCWalker This method, proposed in Xia et al. (2014), is a random walk with restart, which exploits three academic factors, i.e., coauthor order, latest collaboration time, and times of collaboration, to define link importance in a homogeneous co-author network for finding most related collaborators (MVCs).

(8) RWR-LP The proposed method in Lee and Adorna (2012), referred to as RWR-LP, implements a RWR-based algorithm on a modified network by altering an existing heterogeneous bibliographic network to highlight the relations, employing four importance measures to bias the weights of links. We assign, from the work, the best values to the parameters \(\lambda ,\alpha ,\beta ,\delta ,c\).

Moreover, we test two variants of our RWR-CR algorithm. RWR-CR-S denotes the proposed method only using sequence importance measure to weight edge, and RWR-CR-F only use freshness importance measure.

4.4 Results

To evaluate the recommendations produced by all the methods, we used the following methodology: First, we apply the proposed method on the prior half concurrent authors from \(T_{first}\) decribed in Sect. 4.1 and validate the effectiveness using data from \(T_{second}\), to select desirable parameters. Then, we randomly select 500 nodes from the other half as target authors. Finally, we compare the proposed model with alternative methods in terms of RBP and RR on the target authors, to evaluate each method and to report the averaged results.

Tables 2 and 3 show the effects of calculating average RBP and RR scores on two datasets for collaborator prediction in terms of predicting only new collaborators (the values at the left of the square brackets) and all collaborators (the values in the square brackets) respectively. As discussed before, scientific researchers can make potential cooperation with either new authors or repeated authors. It is intuitive that predicting repeated authors will be easier than predicting new ones, and the results in the two tables confirm that each corresponding quantity in the square brackets is larger than the value outside. There are three different values of the parameter p covering a range from relatively impatient users (\(p=0.5\)) to patient users (\(p=0.8\)) and to very patient users (\(p=0.95\)).

Table 2 Experimental results (%) of different methods on the DBLP dataset
Table 3 Experimental results (%) of different methods on the hep-th dataset

In Table 2, we can see clearly that the proposed RWR-CR outperforms both traditional and random walk-based link prediction methods. Among the first four traditional methods, RA performs better in most cases than other approaches followed by AA with similar values, which is consistent with the observation in Lü and Zhou (2011). This indicates their inability to capture the most important factors with respect to co-authorship construction since traditional methods are common to serve in many scenarios. Overall, their behavior is worse than the random walk-based baselines.

For the next four random walk-based methods, RWR-Homogeneous and MVCWalker are all performed on a homogeneous network with the latter slightly better than the former, and oppositely RWR-Heterogeneous and RWR-LP are run on a heterogeneous graph with better performance than the above two. The reason is that the homogeneous network only contains co-author relations between authors omitting many useful extra information. Conversely, heterogeneous networks can compensate the defects by multiple relations and entities representing richer semantics. Furthermore, for RWR-Heterogeneous, every link in the network was weighted uniformly, resulting in no distinctions between attribute nodes and author nodes and more noisy candidates with deeper walking. However, RWR-LP used a more complicated edge weighting process to guide the random walker’s behavior, so it always made much more accurate candidates, as verified by the values in terms of RBP and RR.

As expected, our methods gain better performance on the DBLP dataset. We also note that RWR-CR-F gives slightly better results than that of RWR-CR-S in terms of all metrics. This seems to indicate that the freshness of the latest interaction between a node pair plays a more important role in edge weighting process than the sequence of the interactions. Meanwhile, using a combination of the two importance measures helps improve the predictive performance further, analogous to the ensemble methods in machine learning where an ensemble is usually significantly more accurate than a single learner.

Compared with the values beside the square brackets, the inside values considering old co-authors as well as new partners are much higher. The reason may be that the number of repeated authors is much smaller than that of new partners in the searching space, so it is a quite difficult problem to predict new collaborators from a large candidate set. The best RR for the two cases are 0.159 and 0.591 respectively, which indicates that the first correct new co-author is on average at rank 6 and the first correct old co-author is at rank 2 in the ordered prediction list. We also observe that the scores with respect to RBP are relatively low. The underlying reasons may be as follows: according to the definition of RBP, more relevant items and more advance ranking, especially the first two give larger RBP values. When \(p\in \{0.5,0.8,0.95\}\), there is a roughly \(\{0.1\%,11\%,60\%\}\) likelihood that a user will enter a second page of 10 results, and consequently hard to gain a relatively high score. In fact, researchers usually have few co-authors in each paper (see Table 1), leading to a brief candidate list. Meanwhile, it is still challenging to find the relevant nodes swamped in a sparse and large network. In summary, the proposed RWR-CR makes a statistically significant difference in comparison with the four traditional methods (\(p\,\hbox{value}<0.01\)) and the first three RWR-based baselines (\(p\,\hbox{value}<0.05\)) under the t test, and comparable with RWR-LP.

In Table 3, we can see the scores produced by all methods on the hep-th dataset. Similar conclusions can be drawn, including that RWR-CR outperforms all the other baselines, RWR-based methods outperform the traditional ones, and RWR-CR performs better than RWR-LP in most cases. The corresponding RBP values in Tables 2 and 3 are nearly the same indicating the similar user experience, but the best RR in Table 3 (\(0.183\ [0.668]\)) are higher than that in Table 2 (\(0.159\ [0.591]\)). The reason may be that we use the full data in hep-th rather than a subset of DBLP, which include all other authors linked to the target ones and contain richer structural information.

Beyond the above results, there is one more thing worth discussing, namely time cost. After running each model with the parameters shown in Tables 2 and 3 500 times, the average iteration number and running time (in second) are illustrated in Table 4. Parameter c is critical to the running time for random walk-based methods, since this value determines whether the walker performs a local or global search. For RWR-CR, it terminates within 6 iterations and less than 3 s on average for recommendation once \(c=0.7\) for DBLP and \(c=0.6\) for hep-th. Compared with RWR-LP (\(c=0.3\)), our model takes a little longer but tolerable time to get results because it has a heavy c to search the network globally. Also, we observe that the time complexity of the two methods are consistent on the two datasets in spite of the graph size. Besides, although RWR-Heterogeneous and MVCWalker need almost equal iterations due to the same RWR-based pattern, the time consumption varies widely on different networks they performed on. From the above results, we can infer that our method is feasible to work on a large network containing hundreds of thousand authors. However, for very large graphs like the total DBLP network, the performance will drop and the fast random walk algorithm should be used.

Table 4 Consumption of different methods on two datasets

4.5 Parameter selection

There are three parameters in this paper to tune, namely \(\alpha ,\lambda \) and c. Because the best parameters may vary in different datasets, we examine each parameter on the two datasets by tuning from 0 to 1 with step length 0.1, and evaluate their impacts on the prediction performance. Figure 3 presents the curves of the RBP (\(p=0.5\)) for three parameters on two datasets.

Fig. 3
figure 3

Parameters selection on two datasets. a DBLP. b hep-th

The parameter \(\alpha \) controls the trade-off between the sequence importance measure and the freshness importance measure. Higher values of \(\alpha \) imply that the former measure plays an important role in determining the edge weight and vice versa. We traversed \(\alpha \) in terms of RBP at \(p=0.5\). Experimental results show that the best \(\alpha \) is 0.5 for DBLP and 0.4 for hep-th, and the change of \(\alpha \) has little impact on the RBP.

The parameter \(\lambda \) controls the trade-off between direct and indirect links. Higher values of \(\lambda \) imply that the random walker moves forward usually through the indirect links. Otherwise, direct links will influence more on the random walker’s behavior. However, the best \(\lambda \) we find here is 0.5 for DBLP and 0.3 for hep-th, which differs from 0.6 used by Lee and Adorna (2012), possibly due to the fact that we have a larger number of authors (21,716 in DBLP and 6680 in hep-th) than the number (2505) in that work to search the bibliographic network. Hence, it is not surprising that the best \(\lambda \) is small to leverage the direct link in our model. Also, we can see that the performance drops sharply with \(\lambda \) larger than 0.6 from Fig. 3.

The parameter c is the damping factor which determines the probability of the random walker moving to one of its neighbors. In other words, this parameter controls how far the walker will move. In this paper, the best c is set to 0.7 for both datasets because the random walker needs to go further, rather than to the vicinity of the source node, to a distant part of the network to find new collaborators with no co-authorship before.

5 Related work

This section reviews some related works on the link prediction problem and the heterogeneous network.

Collaborator recommendation, generally regarded as a link prediction task, also one of the four main problems in academic recommender systems (Wang and Liao 2014), has recently received considerable attention. Most of the previous studies in this category only consider a single co-author relationship, restricting the problem to a homogeneous network in the past decades. A seminal work of the link prediction problem can be found in Liben-Nowell and Kleinberg (2003), which provided a summarization of unsupervised methods for link prediction in homogeneous networks. These unsupervised methods are mainly topology based on vertex similarity measures, using either local neighborhood information or global topology. Supervised methods, that extract multiple features corresponding to topology or node attribute and learn appropriate coefficients to combine them, were subsequently proposed in Al Hasan et al. (2005), Lichtenwalter et al. (2010), Yang et al. (2012). Sun et al. further studied the problem on whether or when a relationship (e.g., co-authorship) will be built using a generalized linear model-based prediction model (Sun et al. 2012). Recently, some other aspects of link prediction were studied, such as robustness measures under noisy environments (Zhang et al. 2016), edge weighting issues in the network (Zhu and Xia 2016; Sett et al. 2016).

Heterogeneous networks outperform homogeneous networks by providing a more real and complete representation for the relationships existing in real-world bibliographic systems, for which we list some related work here. Sun et al. (2011b) proposed a meta path-based model called PathPredict, which extracted some meta path-based topological features from the network and used a supervised model to learn the best weights associated with different topological features, to perform the co-author relationship prediction. In the work of Lee and Adorna (2012), the authors proposed a graph modification process and used four importance measures, ranging from the local and global importance of node to the frequency and recency of interaction, to achieve high quality recommendation list on the altered heterogeneous bibliographic network. This work is mostly closed to our work, with the difference lying in the following aspects: the former work utilizes more measures to define edge weights and view the interactions between two authors from a macro level, namely concentrating on the overall frequency; our work focuses on the interactions rather than other aspects and treat them from a micro level, in other words, concentrating on the individuals.

Recently, heterogeneous information networks have began to attract increasing attentions. Some recent works (Pujari and Kanawati 2015; Yang et al. 2015) have treated the heterogeneous network from a multi-player network view. Yang et al. (2015) modeled five heterogeneous features from a three-layer heterogeneous network, i.e., the research topic network, researcher collaboration network, and the institution network, into a unified framework with a supervised SVM-Rank based method for the research collaborator recommendation. Besides, Tang et al. (2012), Dong et al. (2012) studied the problem of transfer link prediction across heterogeneous networks, focusing on leveraging the acquired information from the source network to help improve the prediction performance in the target network based on the common features between the two networks. Tang et al. (2012), Dong et al. (2012) also considered link prediction in coupled networks (Dong et al. 2015) to predict links in one network by using the pure structure information of another network and the interactions between the two networks. There are many applications based on heterogeneous network, including 4G wireless network coverage prediction (Shaikh et al. 2011), drug-target interaction prediction in heterogeneous biological networks (Chen et al. 2012), POI (point of interest) recommendation with a heterogeneous location-based social network (Wang et al. 2015), etc.

6 Conclusion and future work

The link prediction problem is an open issue in data mining and knowledge discovery, towards which most existing studies based on homogeneous networks could not model well the involved heterogeneous relations always existing in practical academic social networks. In this work, we propose a random walk-based algorithm to retrieve relevant partners on a weighted heterogeneous bibliographic network for predicting potential collaborators of researchers. Experiment results on the DBLP and the hep-th bibliographic network show that the proposed method achieves, by exploring heterogeneous network and RWR, better search quality compared with existing state-of-the-art link prediction methods. In addition, fewer importance measures may also provide better performances, namely two and four in RWR-CR and RWR-LP respectively.

For future work, we wish to collect additional information from node attribute dimension as well as network structure dimension. Beyond co-authorship, acknowledgement is another expression form of collaboration, which can be leveraged to the collaborator prediction problem. More information, such as author age, education, and supervisor, will help to study this task. Therefore, there still remains a lot of work to do with mining valuable information from author profiles to perform prediction. Moreover, extensive experiments should be conducted to validate the effectiveness and practicability of the proposed method on a larger bibliographic dataset or in an interdisciplinary environment.