With the explosive growth of academic writing, it is difficult for researchers to find significant papers in their area of interest. In this paper, we propose a pipeline model, named collective topical PageRank, to evaluate the topic-dependent impact of scientific papers. First, we fit the model to a correlation topic model based on the textual content of papers to extract scientific topics and correlations. Then, we present a modified PageRank algorithm, which incorporates the venue, the correlations of the scientific topics, and the publication year of each paper into a random walk to evaluate the paper’s topic-dependent academic impact. Our experiments showed that the model can effectively identify significant papers as well as venues for each scientific topic, recommend papers for further reading or citing, explore the evolution of scientific topics, and calculate the venues’ dynamic topic-dependent academic impact.

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61602202 and 61603146), the Natural Science Foundation of Jiangsu Province, China (Grant Nos. BK20160427 and BK20160428), Top-notch Academic Programs Project of Jiangsu Higher Education Institutions, the Social Key Research and Development Project of Huaian, Jiangsu, China (Grant No. HAS2015020).
Appendix 1: Decrease the preference for older papers
The original PageRank and most of its modifications set the parameter \(\alpha\) to be 0.1 or 0.15 empirically to weight the contribution of the bias probability, which makes them suffer from bias where older papers are favored. The TPM takes a linear age-taper strategy to address this limit, while our CTPM uses another way to address the bias problem. Taking into account that parameter \(\alpha\) determines the contribution proportions of random choice and transition choosing, the CTPM adjusts \(\alpha\) dynamically according to the papers age. Since new papers have fewer opportunities to be cited than older papers, the CTPM gives newer papers a higher value of \(\alpha\) to indicate that the newer papers should have a high probability to be chosen using random choosing. In addition, we made the following assumptions about the influence of the age of the papers. (1) Researchers prefer up-to-date papers which were published in the last 3 years, so the age-taper of the papers in the last 3 years should be slow. (2) For papers whose ages vary from 4 to 10 years, the age-taper is approximatively linear. (3) If papers are older than 10 years, the age-taper will be slow again. For example, a 20-year old paper has nearly the same timeliness as a 15-year old paper. We applied a Gaussian decay function to make the age-taper agree with the above assumptions. The dynamic is set to
where \(g_{d}\) is the age of the paper d; h is the bandwidth parameter to control the age-taper rate. We experimented h with different values and found an appropriate setting of 10. The age-taper curve for \(h=10\) is illustrated in Fig. 8.
According to Eq. (20), a new paper would be chosen mainly by random choice. In particular, the latest paper with an age of 0 years would be chosen completely at random, since it has no chance to be cited.
Appendix 2: A simple example to illustrate our algorithm
To explain our algorithm more clearly, here, we include an example for illustration. Figure 9 shows a citing network with 4 papers, \(d_{1}\), \(d_{2}\), \(d_{3}\), and \(d_{4}\). Paper \(d_{2}\) is cited by both \(d_{1}\) and \(d_{3}\), and \(d_{3}\) cites both \(d_{4}\) and \(d_{2}\). Here, we describe the calculation of \({\mathrm{TRP}}^{1}(d_{2}|k)\) in detail. The first step is to calculate the Gaussian decay factor of \(d_{2}\); we obtain \(\alpha _{2}\) easily from Eq. 20. Then we need to calculate the \({\mathrm{TRP}}^{0}(d_{2}|k)\). It is also very simple to use the equation \(\frac{r_{d_{2},k}}{r_{d_{1},k}+r_{d_{2},k}+r_{d_{3},k}+r_{d_{4},k}}\). After we initialize \({\mathrm{TPR}}\) for all the papers, we calculate the average value of topic-dependent scores of papers for \(v_{1}\) and \(v_{2}\), respectively, which are denoted as \({{\mathrm{avg}}}_{v_{1}} = \dfrac{{\mathrm{TRP}}^{0}(d_{1}|k)+{\mathrm{TRP}}^{0}(d_{2}|k)}{2}\) and \({{\mathrm{avg}}}_{v_{2}} = \dfrac{{\mathrm{TRP}}^{0}(d_{3}|k)+{\mathrm{TRP}}^{0}(d_{4}|k)}{2}\). Then the initial topic-dependent scores of \(v_{1}\) and \(v_{2}\) are \(V^{0}(v_{1}|k)=\dfrac{{{\mathrm{avg}}}_{v_{1}}}{{{\mathrm{avg}}}_{v_{1}}+{{\mathrm{avg}}}_{v_{2}}}\) and \(V^{0}(v_{2}|k)=\dfrac{{{\mathrm{avg}}}_{v_{2}}}{{{\mathrm{avg}}}_{v_{1}}+{{\mathrm{avg}}}_{v_{2}}}\). So far, we have finished the initialization work and will enter the first iteration. For \(d_{2}\), the bias probability is calculated by the following equation:
In our algorithm, the most complicated step is the calculation of the transition probabilities. For \(d_{2}\), there are 2 papers \(d_{1}\) and \(d_{3}\) citing it, thus the transition probabilities \(T(d_{2}|d_{1},k)\) and \(T(d_{2}|d_{3},k)\) will be greater than 0 and others will be 0. The calculation of \(T^{1}(d_{2}|d_{3},k)\) requires 3 steps to complete. The first step is obtaining \(T^{'}(d_{2}|d_{3},k)\), which can be calculated using Eq. 8 and considering \(L_{d_{3}} = \{d_{2},d_{4}\}\):
The second step is obtaining \(T^{''}(d_{2}|d_{3},k)\). Because \(C_{d_{2}}=\{d_{1},d_{3}\}\), it can be calculated using Eq. 9:
In the last step, we calculate \(T^{1}(d_{2}|d_{3},k)\) as follows according to Eq. 10:
where the calculation of \(T^{''}(d_{4}|d_{3},k)\) is similar to \(T^{''}(d_{2}|d_{3},k)\). The calculation of another transition probability \(T^{1}(d_{2}|d_{1},k)\) follows the same process of \(T^{1}(d_{2}|d_{3},k)\). Once we have calculated \(\alpha _{2}\), \(B^{1}(d_{2}|k)\), \(T^{1}(d_{2}|d_{3},k)\) and \(T^{1}(d_{2}|d_{3},k)\), the \({\mathrm{TRP}}^{1}(d_{2}|k)\) can be obtained easily with Eq. 4, which is defined as follows:
