1 Introduction

Paper-reviewer recommendation refers to the automated process for selecting candidates to perform peer review. This enables journal and conference committees to match papers quickly and accurately to reviewers (McGlinchey et al. 2019). Manually conducting this pairing is labor intensive. Furthermore, it is difficult for a nonprofessional chair to assign suitable matches. Many reviewer assignment systems exist to automate this process (e.g., the Toronto paper matching system (Charlin and Zemel 2013), SubSift (Flach et al. 2010), the Microsoft conference management toolkitFootnote 1, the global review assignment processing engine (GRAPE) (Di Mauro et al. 2005), Erie (Li and Hou 2016), advanced reviewer assignment system (Kou et al. 2015b), and decision support system (Hoang et al. 2019)). These systems are completely automated and have been used for many real conferences (e.g., NIPS, ICML, CVPR, ICCV, and ACML).

The problem of paper-reviewer recommendation is known as the reviewer assignment problem (RAP) (Tayal et al. 2014). Dumais and Nielsen (1992) was the first paper to address this problem. The author treated the RAP as an information retrieval issue and used the latent semantic indexing (LSI) model, to establish the relationship between the reviewer and the paper. With the development of the topic model, Mimno and McCallum (2007) used the more advanced latent Dirichlet allocation (LDA) model and author-topic (AT) model and proposed an author-persona-topic (APT) model to better represent the topics covered by a reviewer. These methods are based on semantic information. To further mine the features of the reviewers and papers, some people used word-based information. Peng et al. (2017) used the term frequency-inverse document frequency (TF-IDF) to mine the statistical characteristics of reviewers and papers. They combined this approach with the topic model to propose the time-aware and topic-based (TATB) model. However, these methods neglect the constraints of the RAP: incompleteness of the reviewer data and interference from nonmanuscript-related papers in the reviewer data. We present these two challenges and their corresponding solutions below.

1.1 Incompleteness of the reviewer data

It is not practical to obtain accurate and latest full-text papers of all reviewers because data collection and processing are difficult, and there are even multilingual data. We usually only take the titles and abstracts of the reviewers’ papers as reviewer data. Incomplete reviewer data have difficulty accurately and quantitatively reflecting the field (topic) of the reviewer’s expertise. To resolve this problem, we use a ranking-based approach to turn the topic distribution into an ordered sequence so that the quantitative probability value of the topic can be ignored, thereby reducing the influence of inaccurate probability values of topics and reducing overfitting to incomplete reviewer data. We first obtain the reviewer and target manuscript topics using the topic model. For the ranking-based approach, we use the normalized discounted cumulative gain (NDCG) as a similarity metric to compute the semantic similarity between the reviewer and the manuscript.

1.2 The interference from nonmanuscript-related papers

When assigning reviewers, we focus on the authors (reviewers) of papers that are highly similar to the manuscript but do not focus on whether this author has published many papers that are not related to the manuscript. In contrast, when calculating full-text similarity, we focus on documents with paragraphs that are highly similar to the query term but also on documents with many other nonsimilar paragraphs. Therefore, when calculating the similarity between the reviewers’ papers as a whole and the manuscript, a large number of irrelevant papers may excessively reduce the similarity between the reviewers and the manuscripts. To resolve this problem, we calculate the similarity between each reviewer’s paper and the manuscript, thus highlighting the importance of papers that are highly similar to the manuscript. We also measure the impact of low-similarity papers by calculating the similarity of the manuscript and all papers of the reviewer. This is because it is difficult to directly weigh the impact of low-similarity papers, e.g., how many low-similarity papers can be equivalent to one high-similarity paper? Finally, we combine these two factors in an iterative way.

Our contributions in this paper are summarized as follows.

  1. (1)

    We propose a word and semantic-based iterative model (WSIM) that considers for the first time the constraints of the reviewer assignment problem by improving the metrics between reviewers and manuscripts.

  2. (2)

    We use the NDCG as the similarity metric to compute the semantic similarity of the topic. This approach ignores the probability value (quantitative exact value) of the topic and considers only the ranking (qualitative relevance), thus reducing overfitting to incomplete reviewer data.

  3. (3)

    We use an iterative model to reduce the interference in the assignment from nonmanuscript-related papers in the reviewer data. This approach considers the similarity between the manuscript and each of the reviewer’s papers, thus reducing the importance of nonmanuscript-related papers in the reviewer data.

  4. (4)

    We perform experiments through two datasets, six metrics, and seven comparison algorithms to show that our model is effective in overcoming the challenges.

This paper is organized as follows. Section 2 describes the related research. Section 3 provides the problem formulation, explains our proposed model, describes the model learning algorithm, and introduces the applications of the model. Section 4 describes the experimental setup, the comparison methods, and the performance results. Finally, Sect. 5 concludes this paper.

2 Related work

The authors of Dumais and Nielsen (1992) were the first to discuss automated reviewer recommendations, acknowledging the importance of this task for journal editors as well as the drawbacks of manual assignment. This problem has many names including the conference paper assignment problem (CPAP) (Goldsmith and Sloan 2007), RAP (Wang et al. 2008), paper-reviewer assignment (PRA) (Long et al. 2013), and reviewer assignment (RA) (Wang et al. 2013). Dumais and Nielsen (1992) divided the problem into two processes: selecting the most suitable reviewers for a manuscript and determining the most suitable reviewers for many manuscripts in the case of restrictions. The former is termed retrieval-based RAP (RRAP) (Kou et al. 2015a), while the latter can be termed constrained multiaspect committee review assignment (CMACRA) (Karimzadehgan and Zhai 2009), assignment-based RAP (ARAP) (Kou et al. 2015a) or the multiagent resource allocation problem (MARA) (Lian 2018).

We use the terms RRAP and ARAP as in (Kou et al. 2015a). The ARAP focuses more on optimization issues (Yeşilçimen and Yıldırım 2019) (e.g., how many relevant manuscripts need to be assigned to each reviewer to achieve global optimization?). This paper focuses on RRAP, which can be divided into three categories of related methods: based on semantic information, based on word information, and based on other information. The semantic information corresponds to some relationship between words, typically described by the topic. Word information is used to define the relationship between the reviewer and the manuscript through statistical word frequency and other information. In addition to semantic information and word information, nontextual information can be used to calculate similarity, including classification information (Zhang et al. 2020a; Liu et al. 2016) pertaining to the paper and information provided by the reviewers (Rigaux 2004; Di Mauro et al. 2005). A rule-based (Di Mauro et al. 2005), collaborative filtering (Rigaux 2004) or network-based (Fair and accurate reviewer assignment in peer review 2019; Xu et al. 2019; Anaya et al. 2019) method is often used for this type of information. We focus on methods based on semantics, words, and a combination of these types of information.

2.1 Semantic-based approach

Dumais and Nielsen (1992) transformed RRAP into a retrieval problem using latent semantic indexing (LSI) to extract semantic information and used cosine similarity to calculate the similarity between the reviewer and the manuscript. LSI is a common method for extracting topic information, and Ferilli et al. (2006) and Li and Hou (2016) also used this method. pLSA (Karimzadehgan and Zhai 2012, 2009) and LDA (including variants) (Charlin and Zemel 2013; Misale and Vanwari 2017; Kim and Lee 2018) are improved methods for extracting topic information. Karimzadehgan et al. (2008) first used the pLSA model to obtain the topic and calculate the similarity between the reviewer and manuscript. Mimno and McCallum (2007) first used the LDA model to extract semantic information and proposed the APT model to improve LDA with respect to describing textual information from reviewers and manuscripts. Li and Watanabe (2013) combined the APT model with a time factor to measure the degree of expertise of reviewers. Based on this, Peng et al. (2017) employed TF-IDF to consider word information. Kou et al. (2015a) used the topic weighted coverage calculation based on the topic feature of LDA and proposed the branch-and-bound algorithm (BBA) to find reviewers in the fastest time. In addition to the topic model, Ogunleye et al. (2017) used word2vec to calculate similarity. Zhao et al. (2018) transformed the RRAP into a classification problem and used the word mover’s distance (WMD) method to calculate similarity and then used the constructive covering algorithm (CCA) to simultaneously classify reviewers and manuscripts. In (Zhang et al. 2020b), RRAP was also cast as a multilabel classification task in which the reviewers were assigned according to multiple predicted labels.

2.2 Word-based approach

The most commonly used word-based methods are keyword matching (Sidiropoulos and Tsakonas 2015; Protasiewicz et al. 2016; Shon et al. 2017; Dung et al. 2017), TF-IDF (Hettich and Pazzani 2006; Flach et al. 2010; Peng et al. 2017), and the language model (LM) (Mimno and McCallum 2007; Tang et al. 2010; Charlin et al. 2012). Tang and Zhang (2008) calculated the similarity between reviewers and manuscripts by constructing a keyword network and using cosine similarity for keyword matching. Protasiewicz (2014) added publication time information to calculate keyword weights. Dung et al. (2017) improved the keyword matching results by improving the Knuth-Morris-Pratt (KMP) algorithm. Yarowsky and Florian (1999) first used TF-IDF and cosine similarity to calculate the similarity between the reviewer and manuscript. Basu et al. (2001) used a TF-IDF-based information integration system (WHIRL) combined with collaborative filtering. They obtained the recommendation source matrix using the scores retrieved by WHIRL. Biswas and Humayun (2007) mapped keywords to topics based on TF-IDF, which combines ontology-driven inferences. Protasiewicz et al. (2016) directly retrieved relevant reviewers using a full-text index based on TF-IDF. Charlin and Zemel (2013) used LM as the similarity calculation method for the Toronto paper matching system.

2.3 Approach combining semantic and word information

Few existing methods simultaneously consider the semantic and word information of reviewers and manuscripts to capture the semantic and word similarity between a reviewer and a manuscript. Tang et al. (2010, 2012) were the first to combine the language model and LDA to calculate the similarity between reviewers and manuscripts. Peng et al. (2017) used term frequency-inverse document frequency (TF-IDF) to mine the word information of reviewers and papers. They combined this approach with the topic model to propose the time-aware and topic-based (TATB) model.

These semantic-based or word-based approaches treat the reviewer assignment problem as an information retrieval problem but do not take into account the constraints of the reviewer assignment problem. Hence, we propose a WSIM based on LDA and LM to account for the constraints of the reviewer assignment problem by improving the similarity calculations between reviewers and manuscripts.

3 Proposed model

In this section, we first formulate the reviewer assignment problem and notation used in this paper. Then, we describe the word and semantic information extraction. Finally, we detail the ranking-based approach and iterative model for considering the constraints of the reviewer assignment problem.

3.1 Problem definition and notation

First, we define our terms in a formal way. We define a set of reviewer papers \(\mathbf {D}=\{d_1,d_2,...,d_{|\mathbf {D}|}\}\) and a set of manuscripts \(\mathbf {P}=\{p_1,p_2,...,p_{|\mathbf {P}|}\}\), where \(d_i\) and \(p_i\) denote the text information (e.g., title, abstract, etc.) of the reviewer’s paper and manuscript, respectively. We define a set of reviewers \(\mathbf {R}=\{r_1,r_2,...,r_{|\mathbf {R}|}\}\), where \(r_i\) denotes the text information (composed of \(d_j\in \mathbf {D}\)) of the reviewer.

Then, we define our problem in a formal way. Given three sets \(\mathbf {D},\mathbf {P},\mathbf {R}\) and topN (the number of reviewers required for each manuscript), our goal is to obtain the most suitable topN reviewers (a subset of \(\mathbf {R}\)) for each manuscript \(p_i\in \mathbf {P}\).

The definition of the retrieval-based RAP (RRAP) is given above. We solve this problem based on two characteristics of reviewer data. In the next subsection, we will begin to describe the proposed word and semantic-based iterative model (WSIM) for the reviewer assignment problem. Table 1 lists the notation used in the proposed model.

Table 1 Notations

3.2 Feature extraction

To calculate the similarity between the reviewer and the manuscript, we need to obtain the semantic and word features of the reviewer and the manuscript. The semantic feature captures the word cooccurrence information between the topics, and the word feature captures the word cooccurrence information between the documents. These two different levels of information make the similarity calculation more comprehensive.

3.2.1 Semantic features

We use the topic model (LDA) to demonstrate the use of the semantic information corresponding to the reviewer publications and the manuscript text. LDA assumes that the text contains multiple topics, following the unigram hypothesis, and obtains the topics of a text by using Gibbs sampling. We use LDA on each reviewer’s textual information to obtain the reviewer-topic distribution \(\theta _{mat}\):

$$\begin{aligned} \theta _{mat}\;= & {} \{\theta _1,...,\theta _{|{\mathbf {R}}|}\} \nonumber \\ \theta _m\;= & {} \{\theta _{m,1},...,\theta _{m,K}\},\quad 1\leqslant m\leqslant |{\mathbf {R}}| \nonumber \\ \theta _{m,i}\;= & {} \frac{n_{m,i}+\alpha }{\sum _{j=1}^{K}{(n_{m,j}+\alpha )}},\quad 1\leqslant i\leqslant K \end{aligned}$$
(1)

where K denotes the number of topics, \(n_{m,i}\) denotes the occurrence of the ith topic within the topics covered by reviewer \(r_m\), as obtained by Gibbs sampling, and \(\alpha \) denotes the hyperparameter of the LDA model. After obtaining the reviewer-topic distribution, we can predict the manuscript-topic distribution \(\theta _{\mathbf {P}}=\{\theta _{p_1},...,\theta _{p_{|\mathbf {P}|}}\}\), where \(\theta _{p_m}\) denotes the polynomial topic distribution of manuscript \(p_m\). The topic-word distribution \(\varphi _{mat}=\{\varphi _,...,\varphi _{K}\}\) is similar to \(\theta _{mat}\), while m, K, and \(\alpha \) are replaced with k, V, and \(\beta \).

For consistency, the topics of each reviewer are directly represented by the topics of the reviewer’s papers. This method requires a separate representation of the reviewer’s papers. According to the reviewer-topic model, the paper-topic distribution \(\rho _{mat}\) is expressed as Eq. (2):

$$\begin{aligned} \begin{aligned} \rho _{mat}\;&=\{\rho _1,...,\rho _{|\mathbf {D}|}\} \\ \rho _m\;&=\{\rho _{m,1},...,\rho _{m,K}\},\quad 1\leqslant m\leqslant |\mathbf {D}| \\ \rho _{m,i}\;&=\frac{n_{m,i}+\alpha }{\sum _{j=1}^{K}{(n_{m,j}+\alpha )}},\quad 1\leqslant i\leqslant K \\ \end{aligned} \end{aligned}$$
(2)

where \(n_{m,i}\) denotes the occurrence of the ith topic in the mth reviewer’s paper \(\rho _{m,i}\).

Thus, we represent the semantic features of the textual information using the topic distribution \(\theta _{mat},\theta _{\mathbf {P}},\rho _{mat}\).

3.2.2 Word features

We use the language model to demonstrate the representation of the word information. In the language model, the relevance between a query word w and a paper \(d_i\) can be expressed as the probability of generating \(P_{LM}(w|d_i)\) or \(P_{ LM}(w|r_i)\), as follows:

$$\begin{aligned} \begin{aligned} P_{LM}(w|d_i)=\frac{N_{d_i}}{N_{d_i}+\lambda }\cdot \frac{tf(w,d_i)}{N_{d_i}}+(1-\frac{N_{d_i}}{N_{d_i}+\lambda })\cdot \frac{tf(w,\mathbf {D})}{N_\mathbf {D}} \\ P_{LM}(w|r_i)=\frac{N_{r_i}}{N_{r_i}+\lambda }\cdot \frac{tf(w,r_i)}{N_{r_i}}+(1-\frac{N_{r_i}}{N_{r_i}+\lambda })\cdot \frac{tf(w,\mathbf {R})}{N_\mathbf {R}} \end{aligned} \end{aligned}$$
(3)

where \(N_{d_i}\) denotes the length of paper \(d_i\), \(\lambda \) denotes the average length across all of the papers, \(tf(w,d_i)\) denotes the number of times word w appears in paper \(d_i\), \(tf(w,\mathbf {D})\) denotes the number of times word w appears in all papers \(\mathbf {D}\), and \(N_\mathbf {D}\) denotes the total length of all of the papers. The parameters in \(P_{LM}(w|r_i)\) are analogous.

The query term w is derived from any manuscript \(p_k\). To effectively capture the importance of certain low-frequency words and reduce the weight of insignificant high-frequency words, we extract the word collection \(\mathbf {p}_k\) without considering the repeated words in the manuscript \(p_k\). Different manuscripts contain different numbers of words, potentially causing an order of magnitude difference in the results for manuscripts of different lengths in the language model. To solve this problem, we sorted the words in manuscript \(p_k\) to obtain the collection of the first t words \(\mathbf {W}_{p_k}\), resulting in manuscripts of equal length. This process is described in Eq. (4):

$$\begin{aligned} \begin{aligned} \mathop {\arg \max }_{\mathbf {W}_{p_k}}\quad&P_{LM}(w_t|d_i)\\ where\quad&\mathbf {W}_{p_k}=\{w_1,w_2,...,w_t\}\subseteq \mathbf {p}_k,\; d_i\in \mathbf {D}\\ s.t.\quad&\forall w_j\in \mathbf {W}_{p_k}\\ \qquad&\Rightarrow P_{LM}(w_j|d_i)\geqslant P_{LM}(w_{j+1}|d_i) \end{aligned} \end{aligned}$$
(4)

Finally, we obtain the word-based similarity \(LM(p_k,d_i)\) between manuscript \(p_k\) and paper \(d_i\):

$$\begin{aligned} LM(d_i,p_k)=\prod _{w_j\in \mathbf {W}_{p_k}}{P_{LM}(w_j|d_i)} \end{aligned}$$
(5)

Thus, we represent the word features of the textual information using an improved language model.

3.3 Ranking-based approach and iterative model

After obtaining the features of the reviewers and the manuscript, we detail the ranking-based approach and iterative model for considering the constraints of the reviewer assignment problem.

3.3.1 Ranking-based approach

To reduce the influence of inaccurate probability values of topics, we use the NDCG as the similarity metric to turn the topic distribution into an ordered sequence so that the quantitative probability value of the topic can be ignored, thereby reducing the influence of inaccurate probability values of topics. This approach ignores the probability value (quantitative exact value) of the topic and considers only the ranking (qualitative relevance), thus reducing overfitting to incomplete reviewer data.

The NDCG similarity between reviewer r and manuscript p is expressed as \({\text {NDCG}}_K(r,p)\) using \(\theta _{mat}\) and \(\theta _{\mathbf {P}}\). \({\text {NDCG}}_K(r,p)\) must be normalized to calculate the topic similarity. A topic’s NDCG (tNDCG) similarity is expressed as Eq. (6), and \({\text {tNDCG}}_K(r,d)\) is analogous.

$$\begin{aligned} \begin{aligned} {\text {tNDCG}}_K(r,p)&=\frac{{\text {NDCG}}_K(r,p)-\frac{{\text {bDCG}}_K}{{\text {iDCG}}_K}}{1-\frac{{\text {bDCG}}_K}{{\text {iDCG}}_K}}\\ {\text {NDCG}}_K(r,p)&=\frac{{\text {DCG}}_K(r,p)}{{\text {iDCG}}_K} \\ \end{aligned} \end{aligned}$$
(6)

where \({\text {iDCG}}_K\), \({\text {bDCG}}_K\), and \({\text {DCG}}_K(r,p)\) are further defined as:

$$\begin{aligned} \begin{aligned} {\text {iDCG}}_K&=\sum _{i=1}^{K}{\frac{y(i)}{\log _2(i+1)}} \\ {\text {bDCG}}_K&=\sum _{i=1}^{K}{\frac{K-i+1}{\log _2(i+1)}}\\ {\text {DCG}}_K(r,p)&=\sum _{i=1}^{K}{\frac{y\big (rank[x(\theta _r),i,x(\theta _p)]\big )}{\log _2(i+1)}}\\ where\quad&x(\theta _r)=\{k_1,...,k_K\},\theta _r\in \theta _{mat},\theta _p\in \theta _{\mathbf {P}} \\ s.t.\quad&\forall i\in [1,K-1]\Rightarrow \theta _{r,k_i}\geqslant \theta _{r,k_{i+1}},\;\theta _{r,k_i}\in \theta _r \end{aligned} \end{aligned}$$
(7)

where \(x(\theta _r)\) denotes the probability ranking order of the topics (in reverse order). \(rank[x(\theta _r),i,x(\theta _p)]\) represents the ranking of topic \(k_i\) of \(x(\theta _p)\) in \(x(\theta _r)\). The function y denotes the rank value function and \(y(i)=i^{-\frac{1}{2}}\). The \({\text {bDCG}}\) (bad DCG) denotes the lower bound of \({\text {DCG}}_K(r,p)\). The role of \({\text {bDCG}}\) is to achieve the normalization of \({\text {NDCG}}_K(r,p)\).

3.4 Iterative model

To reduce the interference in the assignment from nonmanuscript-related papers in the reviewer data, we calculate the similarity between each reviewer’s paper and the manuscript, thus highlighting the importance of papers that are highly similar to the manuscript. Then, we measure the impact of low-similarity papers by calculating the similarity of the manuscript and all papers of the reviewer. This is because it is difficult to directly weigh the impact of low-similarity papers, e.g., how many low-similarity papers can be equivalent to one high-similarity paper? Finally, we combine these two factors in an iterative way.

When we combine these two factors using an iterative model, the similarity of one reviewer to the manuscript is influenced by the similarity of the manuscript and each paper for that reviewer, and the similarity of one reviewer’s paper to the manuscript is influenced by the similarity of each author (reviewer) to the manuscript. We can describe this with the following formula, Eq. (8):

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} \gamma ^0[r]={\text {tNDCG}}_K(r,p)\cdot LM(r,p)\\ \gamma ^k[r]=(1-\xi _d)\gamma ^{k-1}[r]+\xi _d\cdot \pi ^k[f_{rd}(r)] \end{array}\right. } \\ {\left\{ \begin{array}{ll} \gamma ^0[d]={\text {tNDCG}}_K(d,p)\cdot LM(d,p)\\ \gamma ^k[d]=(1-\xi _r)\gamma ^{k-1}[d]+\xi _r\cdot \pi ^k[f_{dr}(d)] \end{array}\right. } \end{aligned} \end{aligned}$$
(8)

where \(\gamma ^k[r]\) denotes the relevance of reviewer r to the manuscript at the kth iteration and \(\gamma ^k[d]\) denotes the relevance of the reviewer’s paper d to the manuscript at the kth iteration. Further, \(\xi _d\) denotes the iterative weight of the reviewer’s paper, \(\xi _r\) denotes the iterative weight of the reviewer, \(f_{rd}(r)\) denotes all of the papers of reviewer r, and \(f_{dr}(d)\) denotes all of the reviewers of the reviewer’s paper d.

In the above formula, \(\pi ^k[f_{rd}(r)]\) is essential. It denotes the relevance of the reviewer’s r paper to the manuscript. Because nonmanuscript-related papers overshadow manuscript-related papers, we highlight the importance of papers that are highly similar to the manuscript by the function \(\pi \). By formulating the relevance of reviewer r’s collection of papers as \(f_{rd}(r)\), different weights can be assigned to reviewers’ papers with different influences that can distinguish reviewers’ papers of different levels of importance. We determine the ranking \(\mu ^k_r\) of the relevance between all of the reviewers’ manuscripts and the target manuscripts for reviewer r: (in the kth iteration)

$$\begin{aligned} \begin{aligned}&\mu ^k_r=\{\gamma ^{k-1}[d_1],...,\gamma ^{k-1}[d_h]\} \\ where\quad&d_i\in f_{rd}(r),\;h=|f_{rd}(r)| \\ s.t.\quad&\forall i\in [1,h-1]\Rightarrow \gamma ^{k-1}[d_i]\geqslant \gamma ^{k-1}[d_{i+1}] \end{aligned} \end{aligned}$$
(9)

Similarly, the ranking \(\mu ^k_d\) of the relevance between all of the authors of paper d (reviewers) and manuscripts is represented by Eq. (10):

$$\begin{aligned} \begin{aligned}&\mu ^k_d=\{\gamma ^{k-1}[r_1],...,\gamma ^{k-1}[r_l]\} \\ where\quad&r_i\in f_{dr}(d),\;l=|f_{dr}(d)| \\ s.t.\quad&\forall i\in [1,l-1]\Rightarrow \gamma ^{k-1}[r_i]\geqslant \gamma ^{k-1}[r_{i+1}] \end{aligned} \end{aligned}$$
(10)
(11)

To ensure the stability of the iteration, we must normalize the accumulation of the relevance of all of the reviewer’s papers. The most relevant reviewer paper is assigned a weight of \(\eta \), and all of the reviewer’s remaining papers are assigned a weight of \(1-\eta -(1-\eta )^h\), which is weighted in a recursive method. Equation (11) shows how the function \(\pi \) assigns weights to the papers that are most relevant to the target manuscript. In the kth iteration, the relevance \(\pi ^k[f_{rd}(r)]\) of the reviewer’s r paper to the manuscript and the relevance \(\pi ^k[f_{dr}(d)]\) of the reviewer’s paper d to the manuscript are expressed as Eq. (12):

$$\begin{aligned} \begin{aligned} \pi ^k[f_{rd}(r)]=\sum _{i=0}^{h-1}\frac{{(1-\eta )^i\eta }}{1-(1-\eta )^h}\mu ^k_{r,i+1} \\ \pi ^k[f_{dr}(d)]=\sum _{i=0}^{l-1}\frac{{(1-\eta )^i\eta }}{1-(1-\eta )^l}\mu ^k_{d,i+1} \end{aligned} \end{aligned}$$
(12)

where \(h=|f_{rd}(r)|\) denotes the number of papers authored by reviewer r, \(\eta \) denotes the weighting factor, and \(l=|f_{dr}(d)|\) denotes the number of authors (reviewers) of the reviewer’s paper d. Further, \(\mu ^k_{r,i+1}\) denotes the \((i+1)\)-ranked relevance score in \(\mu ^k_r\).

Fig. 1
figure 1

An example of WSIM reducing the interference in the assignment from nonmanuscript-related papers in the reviewer data. a The relationship schema between the reviewer and the reviewer’s paper; b the iterative process based on the relationship schema

Figure 1a depicts the relationship schema between the reviewer and the reviewer’s paper, resulting in \(f_{rd}(r)\) and \(f_{dr}(d)\). Figure 1b depicts an example of the iterative process based on the relationship schema. In this example, the reviewer’s papers \(\{d_1,d_2,d_3\}\) influence reviewer \(r_1\) through \(\xi _d\cdot \pi ^k[f_{rd}(r_1)]\), and the reviewers \(\{r_1,r_2\}\) influence the reviewer’s paper \(d_3\) through \(\xi _r\cdot \pi ^k[f_{dr}(d_3)]\), all of which together form a coupled random walk.

figure a

Reviewers of the same manuscript do not consider rankings. We use an averaging process for the relevance of the reviewer’s papers before iteration so that the papers of the topN reviewers do not consider the ranking. First, we calculate the sorting of \(\gamma ^0_d\) for each paper. Then, we evaluate all \(\gamma ^0_d\) according to the reviewer’s average number of papers and the number of reviewers to be assigned to the target manuscript, and we average the relevance of the reviewer’s papers. In Algorithm 1 outlined below, we describe the main process of WSIM.

Thus, we use a ranking-based approach and iterative model to consider the constraints of the reviewer assignment problem and to obtain the most suitable topN reviewers for each manuscript.

4 Experiments

In this section, we evaluate the effectiveness of our WSIM method. We construct experiments using closed-world settings (Price and Flach 2017) with a fixed predetermined pool of reviewers to conduct a comparison with seven existing methods.

4.1 Dataset

Typically, journals do not disclose their specific manuscript review process because of fairness and privacy, so it is difficult to obtain a real manuscript review process. This problem makes it difficult to use current real datasets. For example, in paper (Karimzadehgan et al. 2008), their datasetFootnote 2 is too small and lacks time. Another example is in paper (Tang et al. 2012), whose datasetFootnote 3 lacks reviewer paper information and assignment results. There is also a paper (Kou et al. 2015a), whose datasetsFootnote 4 lack the allocation results that can be used for evaluation. The paper (Mimno and McCallum 2007) provides a manually assigned dataset for NIPS2006, but it is not publicly available. Therefore, we used two data sources to construct a real dataset. All datasets are released on GitHubFootnote 5.

4.1.1 First dataset

Table 2 describes the dataset in detail. This dataset consists of reviewer profiles, which comprise their publications (including titles, abstracts, and years) and labels. The label indicates the peer review relationship between the target manuscript and the reviewer. It uses a binary value to describe whether the reviewer can review the target manuscript.

Table 2 First dataset

We apply a rule to obtain the labels for the field (classification) of reviewers and manuscripts: a reviewer who has published at least 10 papers in a field corresponding to the manuscript is eligible to review that manuscript. In this setup, each reviewer has at least one field that corresponds to at least 10 papers published by that reviewer (whether or not consistent with the target manuscript), which forms the qualification for becoming a candidate reviewer.

4.1.2 Second dataset

This dataset comes from the public data source of arXivFootnote 6, which contains a total of 1,180,081 papers. All papers contain titles, abstracts, authors, publication time, subject, and 1,031,734 papers without MSC classification. We use the subject as the field. As with the processing of the first dataset, we constrain the information of reviewers and manuscripts during preprocessing. The difference is that reviewers who have published at least 20 papers in a field corresponding to the manuscript are eligible to review that manuscript. Table 3 describes the details of the dataset, and finally, we obtain 1885 reviewers and 685 manuscripts from the second dataset, which simulates a medium-sized conference.

Table 3 Second dataset

4.1.3 Validation dataset

To find a common set of hyperparameters, we constructed a validation dataset using the same methods and data sources as the first dataset. Table 4 describes the dataset in detail.

Table 4 Validation dataset

4.2 Comparison methods

We compare our WSIM with the following seven methods, which include classic algorithms and state-of-the-art algorithms: LDA (equivalent to the author-topic model) (Mimno and McCallum 2007), LM (Charlin and Zemel 2013), LDA-LM (Tang et al. 2010), TATB (time-aware and topic-based model) (Peng et al. 2017), KCS (keyword cosine similarity) (Protasiewicz et al. 2016), BBA (Kou et al. 2015a), and WMD (Kusner et al. 2015).

LDA This method calculates the cosine similarity of the topic distribution probability between the reviewer and the manuscript to determine the appropriate reviewers for the manuscript.

LM The field of the manuscript is regarded as a query term; the method calculates the probability that the query term is present in the reviewer’s information to obtain the appropriate reviewers for the manuscript (see Eq. (3)).

LDA-LM This approach combines the results of LDA and LM to determine the appropriate reviewers for the manuscript based on the total score.

TATB Based on LDA, the papers published by reviewers are assigned different weights over time and multiplied by the results of TF-IDF to determine the appropriate reviewers based on the resulting scores.

KCS This method uses the Kea algorithm to extract the keywords of the reviewers and target manuscripts, assigns weights to the keywords with respect to the publication time of the paper in which the keyword is located and calculates the cosine similarity between the reviewer and the target manuscript.

BBA This approach uses LDA to obtain the topic distribution of the reviewers and the target manuscripts. The topic distribution of all of the reviewers for a target manuscript is considered as a whole (a group of reviewers), and the branch-and-bound method is used to quickly determine the appropriate reviewers.

WMD This approach uses word2vec to calculate the word embedding of the reviewers and the target manuscripts and then uses earth mover’s distance to calculate the similarity between the text excerpts.

4.2.1 Hyperparameters

We perform a random search (Bergstra and Bengio 2012) in the hyperparameter space using the validation dataset, and the result is as follows. The hyperparameters in the LDA model of the WSIM and comparison methods include the number of fields (topics) K, the hyperparameter \(\alpha \), the hyperparameter \(\beta \), and the number of iterations, which are set to 50, 0.5, 0.1, and 3000, respectively. The hyperparameters in the WSIM include t, \(\eta \), \(\xi _d\), and \(\xi _r\), which are set to 80, 0.25, 0.05, and 0.05, respectively. WMD uses 300-dimensional word embedding. The hyperparameters used for the other comparison methods are consistent with the respective original papers. Our implementations are available on GitHubFootnote 7.

4.3 Evaluation metrics

We use the methods to find the topN reviewers for each manuscript and compare the result of each method with the labels in the dataset. We use the precision, recall, and F1 score as evaluation metrics. We also employ several popular information retrieval measures (Büttcher and Clarke 2016) including mean averaged precision (MAP), normalized discounted cumulative gain (NDCG), and bpref (Buckley and Voorhees 2004). The metrics are defined below:

$$\begin{aligned} \begin{aligned} \text {Precision:}~\text {P}&=\frac{1}{N}\sum ^{N}_{1}\frac{TP}{TP+FP} \\ \text {Recall:}~\text {R}&=\frac{1}{N}\sum ^{N}_{1}\frac{TP}{TP+FN} \\ \text {Macro-F1~score:}~\text {F}_1&=\frac{2\cdot \text {P}\cdot \text {R}}{\text {P}+\text {R}}\\ \end{aligned} \\ \begin{aligned} \text {Mean Average Precision:}~\text {MAP}&=\frac{1}{N}\sum ^{N}_{1}\frac{1}{R_n}\sum ^{R_n}_{i=1}(\text {P}@i\cdot R(c_i)) \\ \text {NDCG}&=\frac{1}{N}\sum ^{N}_{1}\frac{\sum _{i=1}^n\frac{R(c_i)}{\mathrm{log}_2(i+1)}}{\sum _{i=1}^n\frac{1}{\mathrm{log}_2(i+1)}} \\ \text {Binary preference:}~\text {bpref}&=\frac{1}{N}\sum ^{N}_{1}\frac{1}{R_n}\sum ^{R_n}_{r=1}(1-\frac{\sum _{i=1}^r(1-R(c_i))}{R_n}) \end{aligned} \end{aligned}$$

where \(N=|\mathbf {P}|\), TP denotes the number of true positives, FP denotes the number of false positives, and FN denotes the number of false negatives. In addition, \(n=topN\), \(R_n\) is the number of reviewers who are eligible to review the target manuscript, and \(R(c_i)=1\) if the i-th retrieved candidate is relevant to the target manuscript and \(R(c_i)=0\) otherwise.

4.4 Experimental results

We examined the performance of the WSIM and comparison methods when topN was 10, 20, 30, and 50. Tables 5 and 6 show the results of the WSIM and the seven comparison methods with respect to the precision, recall, F1 score, MAP, NDCG, and bpref metrics on the first dataset and second dataset. The results indicate that our proposed WSIM is superior to all of the comparison methods, including the latest RRAP method (TATB). This proves our effectiveness in overcoming challenges. The WSIM is better than the three types of methods it is based on, namely, LM (word-based), LDA (semantic-based), and LDA-LM (word and semantic-based). The WSIM also outperforms other methods: KCS (word-based), BBA (semantic-based), TATB (word and semantic-based), and WMD (word embedding or semantic-based). This is because we consider the constraints of RAP, not just RAP, as an information retrieval problem.

Here, some analyses of comparative methods are presented: (1) The performance of LDA is weaker than LM, which is better suited for short texts (title and abstract). (2) The direct combination of LDA and LM does not result in an improvement in performance (Table 5) because this combination is not complementary and may result in incorrect results being adversely affected. (3) TATB uses the TF-IDF method but does not suitably represent the word information. (4) KCS uses keyword weight calculations but does not represent semantic information. (5) BBA uses topic-based coverage to calculate relevance, without considering word information, and this coverage only finds the appropriate reviewer group for the target manuscript and does not ensure that each reviewer is appropriate for the target manuscript. (6) WMD uses semantic information but does not consider the constraints of RAP.

Table 5 Method performance comparison for the first dataset
Table 6 Method performance comparison for the second dataset

4.5 Ablation analysis

We conduct an ablation analysis on the WSIM to examine the effectiveness of each component, including the improved LM, ranking-based approach and iterative model. First, for improved LM, we list the existing method LM and its improved method improved LM (I-LM). Second, for the ranking-based approach, we list the existing LDA method and its improved LDA-NDCG method. The original LDA method uses cosine similarity, and we further compare Euclidean distance (LDA-ED) and Jensen-Shannon divergence (LDA-JS). Finally, for the iterative model, we list the existing LDA-LM and improved WSIM methods, including the LDA-NDCG+improved LM (LDA-NDCG+I-LM), which is a zero-iterative WSIM. Tables 7 and 8 show the precision of these methods on the first dataset and second dataset, respectively, with topN values of 10, 20, 30, and 50. The underline indicates the best result in the current component. The bold font is the best result in the current column. We have the following observations and analysis:

(1) Iterative models are helpful for performance improvement. The performance of the WSIM exceeds all LDA-IM and 75% of LDA-NDCG+I-LM. This is because the iterative model reduces the interference in the assignment from nonmanuscript-related papers in the reviewer data. (2) The ranking-based approach is helpful for performance improvement. The performance of LDA-NDCG exceeds that of LDA, LDA-ED, and LDA-JS. This is because the ranking-based approach reduces the influence of inaccurate probability values of topics. (3) Improved LM is helpful for performance improvement. I-LM outperforms LM in 75% of the results. This is mainly because I-LM alleviates the problem caused by the inconsistent text length in the LM method.

Table 7 Experimental results of the original methods and their improved versions (first dataset)
Table 8 Experimental results of the original methods and their improved versions (second dataset)

We explore the influence of the number of iterations on the algorithm performance. Figure 2 shows the precision (topN=20) for different numbers of iterations. Performing zero iterations denotes that the interference in the assignment from nonmanuscript-related papers is not considered. On the first dataset, increasing the number of iterations from one to two results in improved performance because the importance of papers that are highly similar to the manuscript is considered with more iterations. On the second dataset, increasing the number of iterations to one results in the best improved performance. As the number of iterations increases, the proportion of \(\pi ^k[f_{rd}(r)]\) in \(\gamma ^k[r]\) increases at the same time. \(\pi ^k[f_{rd}(r)]\) has highlighted the importance of papers that are highly similar to the manuscript, \(\gamma ^0[r]\) has highlighted the importance of all papers by reviewers, and both are indispensable. Therefore, continuing to increase the number of iterations can overstate the importance of papers that are highly similar to the manuscript and degrade the final performance.

Fig. 2
figure 2

Precision achieved with a varied number of iterations

4.6 Significance test

In this subsection, we analyze the statistical significance of method performance improvement through a significance test. We randomly divide all manuscripts into ten through tenfold cross-validation and then compare the precision of the WSIM and all comparison methods through the two-sided paired t-test (Smucker et al. 2007). Table 9 shows the mean, t-value, and p-value of precision (topN=20) on the two datasets. The first row is a paired t-test for WSIM and WSIM, and they do not differ. The confidence level of the WSIM over the other methods is at least 97.5% in both datasets. This proves that the performance improvement of the WSIM is statistically significant.

Table 9 Two-sided paired t-test results between the WSIM and other comparison methods

For a more comprehensive analysis of the statistical significance between the performance of all methods, we extended the results presented in Table 9 to all methods and all metrics. Table 10 shows the results. The upper part of Table 10 shows whether each method (column) passes the significance test of outperforming each method (row) on six metrics (in the order of P, R, \(\text {F}_1\), MAP, NDCG, and bpref). We set the confidence level at 97.5% and record 1 if it passes the significance test at this confidence level; otherwise, we record 0. For example, “101111” means that only the recall R does not pass the significance test. The bottom half of Table 10 shows the lowest value of the confidence level among the six metrics. For example, in the first dataset, the lowest confidence level at which WSIM outperforms LDA+LM is 0.9517, which is the confidence level of recall R, as seen in the upper part of Table 10. We can obtain the performance ranking between different methods: \(\text {WSIM}>\text {LDA+LM}\ge \text {LM}\ge \text {WMD}>\text {TATB}\ge \text {LDA}>\text {BBA}\ge \text {KCS}\). From Table 10, we can see that this ranking’s confidence level is at least 95%.

Table 10 Two-sided paired t-test results between all methods

4.7 Bias-variance decomposition

In this subsection, we analyze the generalizability of all methods, and we perform a bias-variance decomposition on the precision of each manuscript. We use precision=1.0 as the true output and calculate the bias between it and each method’s precision. The generalization error is equal to the square of the bias plus the variance. Table 11 shows the bias, variance, and generalization error for each method (topN=20) on both datasets. The bias and generalization error of the WSIM are minimal. The variance of the WSIM is almost identical to that of LDA-LM and LM on which it is based. This proves the excellent generalization capability of WSIM, as the WSIM reduces the bias while maintaining the variance.

Table 11 Bias, variance, generalization error (GE) of the WSIM and comparison methods on two datasets

To further determine whether the performance is helped by a few manuscripts or many manuscripts, we show the precision bias between each method and the WSIM on each manuscript. The violin plot in Fig. 3 shows the distribution of precision bias. Each violin in the figure shows the precision bias of a method on each manuscript and is labeled with the maximum, minimum, mean, and median of the precision (topN=20). The wider the violin is, the more the manuscripts that are in that position. From the figure, we can observe that (1) the median value of the precision bias for each comparison method is below zero; and (2) the distribution of the precision bias is close to the normal distribution. This proves that our method improves the performance of most manuscripts, and the improvement is close to the normal distribution.

Fig. 3
figure 3

Precision bias between all methods and the WSIM on each manuscript

4.8 Hyperparameter analysis

In this subsection, we show the performance of important hyperparameters of the WSIM at different values to investigate the impact of different hyperparameter values on the performance. We use the method of control variables to analyze the four most important parameters (\(t,\eta ,\xi _d,\xi _r\)) in the WSIM. Table 12 shows the six values used for each hyperparameter, and the values given in Section 4.2 are in bold. When the value of a hyperparameter is a variable, the other parameters will use fixed boldface values. Figure 4 shows the experimental results of the WSIM on four hyperparameters, six hyperparameter values, six metrics, and two datasets. The horizontal coordinates show the hyperparameter values. The vertical coordinate shows the performance at the current hyperparameter value minus the average performance of the six hyperparameter values. From the range of values of vertical coordinates, we can obtain the following conclusions: (1) the influence of hyperparameters on performance can be ordered as \(t>\eta>\xi _d>\xi _r\); (2) the influence of hyperparameters \(\eta ,\xi _d,\xi _r\) on performance is less than 0.7%; and (3) when hyperparameter \(t>110\), its influence on performance tends to be stable.

Table 12 The values of four different hyperparameters in the WSIM
Fig. 4
figure 4

Performance of the four hyperparameters of the WSIM under different values

4.9 Case study

In this subsection, we provide a case study analysis to show the effectiveness of the WSIM with respect to the experimental evaluation and illustrate the practicality of the method.

To illustrate the effectiveness of the WSIM, we show the matching results of two manuscripts (\(p_1\),\(p_2\)) in the first dataset. To be reasonable, we chose two test samples with single-sample precision approximating the evaluation results (50.25%). The precision (topN=20) of these two manuscripts is 50% and 55%, respectively. Among the reviewers recommended for the manuscript, we focus on five reviewers corresponding to matching errors to show that the WSIM is more effective than the results of the evaluation metric. We use the title and the related fields (classification) of the paper to display the manuscript and reviewer’s information. The reviewer’s title and related fields are obtained from the most similar reviewer papers with respect to the target manuscript.

Table 13 shows the matching results for the two manuscripts, where the fields use symbolic representations, and Table 14 explains the names of the fields corresponding to the symbols. Among the five reviewers matched to manuscript \(p_1\), reviewers \(\{r_{11},r_{12},r_{13},r_{14}\}\) are truly suitable. Among the five reviewers matched to manuscript \(p_2\), reviewers \(\{r_{21},r_{22}\}\) are truly suitable. The reviewers corresponding to matching errors still have many of the appropriate qualifications to review the target manuscript. This is because the groundtruth of the evaluation metrics is strict and using only the label disqualifies some suitable reviewers.

This analysis shows that the WSIM is more effective and practical than the results of evaluation metrics.

Table 13 The matching results for the two manuscripts
Table 14 Notions of fields

5 Conclusions

We proposed an approach named the word and semantic-based iterative model (WSIM) to solve the retrieval-based reviewer assignment problem (RRAP). The WSIM determines the most appropriate reviewers for a target manuscript using a combination of word information and semantic information and considering the constraints of the RAP by improving the similarity calculations between reviewers and manuscripts. We reduce overfitting to incomplete reviewer data and the interference in the assignment from nonmanuscript-related papers in the reviewer data with a ranking-based approach and iterative model. We compare our approach with seven existing methods in closed-world settings, and the experimental results validate the effectiveness of our method.

The RAP includes the retrieval-based RAP, which we address in this paper, and the assignment-based RAP, which requires different strategies for different requirements (O’Dell et al. 2005) and is an interesting problem for future research. In the future, we also plan to provide an efficient system based on our proposed method for use by journals and conferences. In addition, we plan to explore how our methods can be applied to other research topics, such as information retrieval and question-answerers.