1 Introduction

Every year the number of new scientific articles increases. By now researchers can no longer look at all newly published papers even in their respective fields. To this end, paper recommender systems have been developed to help researchers find papers to read or cite. Beel et al. have found over 200 research articles published since 1999 that deal with paper recommender systems [1].

However, the task of recommending papers differs depending on the target group: An experienced researcher might want to get recommendations with a high serendipity. Yet another one might be interested in finding related literature for a new paper. To the best of our knowledge, no recommender system exists that provides a scientist with an overview of a scientific field. The target group of such a system includes students that just started working on their PhDs as well as grant program managers and review panel members looking into unfamiliar research fields. Therefore, such a recommender system can support scientific communities.

But before such a recommender system can be developed the requirements must be analyzed: What characterizes papers that provide an overview of a scientific field? And how can we measure these criteria? In this paper we describe new and existing measures that can characterize sets of papers. Furthermore, we use these measures in a study to determine how papers that give an overview of a scientific field can be characterized.

In the next section of the paper we describe related work. Afterward, the characterization measures are given. In Sect. 4 the aforementioned study is described. The following section analyzes the characteristics of papers giving an overview of a field using the results of the study. The paper concludes with a conclusion and remarks on future work.

2 Related Work

Scientific communities are constantly evolving and changing. As such, keeping track of a community can be a challenging task. Computer tools can support scientific communities by aiding the understanding of a field and its community. This can be achieved by identifying key papers and authors as well as emerging research fronts. These tasks are addressed by the Action Science Explorer tool [2]. It visualizes scientific papers and their citations and displays information on the papers on demand. Among the displayable information are citation contexts and automatically generated summaries of papers. Furthermore, it helps a user in understanding a field by providing various network analysis measures and plotting options.

Another option to support scientific communities is to provide researchers with paper recommendations. Although a vast number of paper recommender systems exist, it is unknown which recommender system is the best. One problem is that no gold standard exists against which new systems can be compared. Therefore, the comparability of systems is hindered. Additionally, many reported results cannot be reproduced due to insufficiently described algorithms or flaws in the evaluations. One such reported flaw lies in the limited use of evaluation metrics: Most paper recommender systems are only evaluated with respect to the accuracy of the recommendations [1]. However, it has been show that other factors also play an important role, e.g. the diversity of recommendations [9].

In the next section we describe various existing and new measures to characterize papers. Our aim is to understand the characteristics of paper sets that provide an overview of a scientific field. In the future we want to use the results and measures to develop a system that recommends such papers. However, the measures can also be used to evaluate and characterize paper recommender systems for other target groups, e.g. recommending serendipitous papers to experienced researchers. As such, by providing a set of measures we hope to help with the problem of evaluation flaws in paper recommender systems.

3 Measures

For measuring the characteristics of papers, measures can be applied to sets of papers or individual papers. Among the possible measurements for sets of papers are topic diversity, the breadth and depth of the covered topics as well as the extent to which all important subtopics are covered by the papers, i.e. coverage. Other measures might consider the coherence of the scientific papers. While the ranking of recommendations usually plays an important role in recommender systems, all measures presented here treat the papers in a set as unordered.

Each individual paper can also be characterized with regard to various features: These can consider the breadth and depth of the covered subtopics within a single paper, the diversity within the paper or in how far the paper is a representative of a scientific research line, i.e. in how far it is prototypical. Moreover, they might take the length or the type of the publication – technical report, conference paper, journal paper – or the comprehensibility of the paper into account.

Measures for set diversity and set coverage have been defined before. These will be described in the following subsections. Moreover, we define additional measures for set diversity, set coverage and paper prototypicality that are also described in this section. Some measures described in this paper use the citation network. This is a directed, acyclic graph G(VE) in which papers are nodes (V) and citations are edges (E). An edge starts at the paper making a citation and ends at the referenced paper.

It should be noted that the measures used in this paper are specific for the case of scientific paper recommendations. The used data structures are geared to scientific papers, e.g. venues, abstracts, authors and citations. Apart from that, in other domains traits not considered here might be desirable, e.g. recommending products from different price ranges.

3.1 Set Diversity

According to Beel et al. [1] only two paper recommender systems take the diversity of a set of papers into account: Vellino [8] and Küçüktunç et al. [4].

Venue-Based Diversity. Vellino [8] considers diversity for the comparison of existing paper recommender systems. In these systems a user has to specify one scientific article of interest based on which recommendations are generated. For each recommended paper the journal distance between the journal the paper was published in and that of the input paper is computed. The journal distance is computed by using a large database with papers from several journals. Based on these distances the diversity of the set is calculated. The approach does not consider the distance of papers published at conferences. However, in many disciplines most papers are published at conferences – e.g. in computer science. Therefore, we will not use this measure.

Density-Based Diversity. Küçüktunç et al. [4] incorporate a diversification process into their system to make citation recommendations for scientific papers. The used diversity measure was developed by Tong et al. [7] to analyze the diversity of a set of nodes of a graph in general.

The diversity of a set of papers R is measured by using the l-density of the set in the underlying citation network, as given in (1). For this Küçüktunç et al. used \(l=2\).

$$\begin{aligned} dens_l(R) = \frac{\sum _{p_i,p_j \in R, p_i \ne p_j}d_l(p_i,p_j)}{|R| \cdot (|R|-1)} \end{aligned}$$
(1)

The l-density is similar to the normal density. The difference is that two nodes are considered to be connected if they are connected via a path of maximally length l. This is expressed in (2) where \(dist(p_i,p_j)\) is the length of the shortest path between two nodes in the citation network. With regard to this shortest path it is unclear whether Küçüktunç et al. use the directed or undirected citation network. However, Tong et al. [7] use the diversity measure for undirected networks. Thus, in this paper the distance \(dist(p_i,p_j)\) is also calculated based on the undirected citation network.

$$\begin{aligned} \begin{aligned} \qquad d_l(p_i,p_j) =&{\left\{ \begin{array}{ll} 1 &{} \text {if } dist(p_i,p_j) \le l\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(2)

Unfortunately, this diversity score is unintuitive: The lower the diversity score, the better the diversity. To overcome this problem Tong et al. [7] invert the diversity measure which then lies in [0.5; 1]. We further normalize the measure to lie within [0; 1] as given in (3).

$$\begin{aligned} diversity_{density}(R) = (\frac{1}{1 + dens_l(R)} - 0.5) \cdot 2 \in [0;1] \subset \mathbb {R} \end{aligned}$$
(3)

Author-Based Diversity. A new approach to calculate diversity looks at the authors of the recommended papers. If all papers have been written by the same set of authors, the diversity can be expected to be low. On the other hand if all papers have been written by completely different authors, a high diversity can be expected. This notion is used in the diversity measure based on authors given in (4) for a set of papers R. Hereby, \(author(p_i)\) returns the set of authors of a paper \(p_i\) and \(uniqueAuthors(p_i, R')\) (5) returns the percentage of authors of paper \(p_i\) that do not participate in any paper in the set of papers \(R'\).

$$\begin{aligned} diversity_{author}(R) = \frac{ \sum _{p_i \in R} { uniqueAuthors(p_i, R \setminus \{p_i\})) } }{ |R| } \in [0;1] \subset \mathbb {R} \end{aligned}$$
(4)
$$\begin{aligned} uniqueAuthors(p_i, R') = \frac{|\{ a ~|~ a \in author(p_i) \wedge a \not \in \bigcup _{p_j \in R'}{author(p_j)}\}|}{|author(p_i)|} \end{aligned}$$
(5)

Similarity-Based Diversity. A different approach considers the topical similarity of the papers in a set. Ziegler et al. [9] and Jones [3] use the similarity of items to measure the diversity. Hereby, a higher similarity of items means a lower diversity. Ziegler et al. use the measure to analyze the topic diversity in commercial book recommendations. Jones uses it to analyze the user acceptance of commercial recommender systems.

Our calculation of diversity based on similarity differs slightly from that used by [9] and [3] in that a higher score indicates a higher diversity. This is shown in (6). For the topical similarity measure the topic structure similarity (tss) [5] is used. The tss similarity is a hybrid similarity measure that is a linear combination of a network-based and a content-based similarity. The diversity measure is divided by the maximum tss value – \(maxTSS = 2\) – in (6) to normalize it.

$$\begin{aligned} diversity_{similarity}(R) = 1 - \frac{\sum _{p_i, p_j \in R, p_i \ne p_j} tss(p_i, p_j)}{|R| \cdot (|R|-1) \cdot maxTSS} \in [0;1] \subset \mathbb {R} \end{aligned}$$
(6)

The three diversity measures (3), (4) and (6) will be used in the remainder of this paper.

3.2 Set Coverage

Coverage is defined in various ways in different publications. In our understanding coverage is the extent to which all relevant subtopics are covered by the papers in the set R. A substitute we propose is to use the average distance of all recommended papers. Papers on the same topic should be close together while papers on different topics should be farther apart. A very large average distance indicates that most probably several topics are covered by the set of papers. A very small average distance on the other hand can indicate that only some subtopics are covered. Therefore, a moderate distance should be targeted. The coverage of a set of papers \(R \subseteq V\) can be measured as given in Eq. (7) where \(d(p_i, p_j)\) is the distance of two papers. In this context the standard deviation of these distances is of interest, too.

$$\begin{aligned} coverage(R) = \frac{\sum _{p_i, p_j \in R, p_i \ne p_j} d(p_i, p_j)}{|R| \cdot (|R|-1)} \end{aligned}$$
(7)

Depending on the distance measure, this measure can produce values larger than 1. With regard to the distance measure, three different calculations are considered: A structure-based, a similarity-based and a hybrid distance.

Structure-Based Distance. The structure-based distance is calculated as the length of the shortest path connecting the two papers in the undirected citation network. Let \(shortestPaths(p_i, p_j)\) return a set of shortest paths that connect the nodes \(p_i\) and \(p_j\) in the undirected network and let length(p) be a function that returns the number of edges that make up this path. Then the calculation of this distance measure is given in (8).

$$\begin{aligned} d_{structure}(p_i, p_j)= length(p) \text {, where } p \in shortestPaths(p_i, p_j) \end{aligned}$$
(8)

Note that due to the construction of the used citation networks – as explained later – the whole citation network is one weak component. Thus, \(\forall p_i,p_j \in V: d(p_i,p_j)<\infty \) holds. Further note that the paths connecting two nodes \(p_i\) and \(p_j \in R\) are not limited to the subgraph induced by the set of nodes R. The coverage measure using this distance can be normalized by dividing by the diameter of the undirected citation network.

Similarity-Based Distance. The similarity-based distance is calculated using the topic structure similarity (tss) [5]. As a higher similarity should infer a lower distance, the distance is calculated as given in (9). Hereby, the maximum value the tss similarity can reach is again denoted as maxTSS.

$$\begin{aligned} d_{similarity}(p_i, p_j) = 1- \frac{tss(p_i, p_j)}{maxTSS} \end{aligned}$$
(9)

Note that coverage(R) using \(d_{similarity}(p_i,p_j)\) is equal to \(diversity_{similarity}(R)\) (6). At a first glance it may seem contradictory that a measure for diversity could also be used as an indicator for coverage. However, both concepts are related. If various papers cover different aspects of a topic, they have a high coverage. Likewise, they also have a high diversity. Similarly, if all papers in a set cover the same aspect of a topic, they have a low coverage. At the same time, they also have a high similarity and therefore low diversity.

Hybrid Distance. The third distance is a hybrid of the previously mentioned distances. Let each of the shortest paths be encoded as the list of nodes along it in their natural order and let p[i] denote the ith node on a path p. Furthermore, let length(path) return the number of edges that make up this path. The hybrid distance measure is given in (10). The shortest path is again determined based on the undirected citation network. If multiple shortest paths exist between two papers, the one with the shortest hybrid distance is to be taken.

$$\begin{aligned} \begin{aligned} d_{hybrid}(p_i, p_j)&= min\{d_{hybrid}(path) ~|~ path \in shortestPaths(p_i, p_j)\}\\ \text {with}\\ d_{hybrid}(path)&= \sum _{l=1}^{length(path)} (1 + maxTSS - tss(path[l], path[l+1]))\\ \end{aligned} \end{aligned}$$
(10)

The coverage measure using this distance can be normalized by dividing by the maximally possible value. This maximum is given as 3d where d is the diameter of the undirected citation network.

3.3 Paper Prototypicality

Linked to the diversity of a set of papers is the prototypicality of each paper. Prototypicality measures in how far a paper is a prominent representative of a specific line of research. If the same group of authors has published several papers, the youngest should be prototypical for all of them. This notion is used in (11) where the prototypicality of the paper depends on the number of papers in the whole network that are published by the same authors – regardless of the order of the authors on the paper – and were published before this paper.

$$\begin{aligned} \begin{aligned} prototypicality_{age}(p_i) = |\{p_j |&p_j \in V \wedge authors(p_i) = authors(p_j) \\&\wedge year(p_i) > year(p_j)\}| \end{aligned} \end{aligned}$$
(11)

Hereby, \(authors(p_i)\) is a function that returns the set of authors of a paper \(p_i\). The function \(year(p_i)\) on the other hand returns the publication year of a paper\(p_i\). The idea behind the measure is that two papers by the same set of authors most probably fall within the same line of research and the younger paper builds on the older ones. The more previous papers exist, the higher this paper’s prototypicality.

On the other hand, a paper that represents a specific line of research should also have influenced many other papers. Therefore, it should have been cited often. Moreover, the citing papers should be similar in topic. This is encoded in (12). Hereby, tss is again the topic structure similarity and \(N_{in}(p_i)\) is the set of papers that cite paper \(p_i\).

$$\begin{aligned} prototypicality_{indegree}(p_i) = \sum _{p_j \in N_{in}(p_i)}{tss(p_j, p_i)} \end{aligned}$$
(12)

The venue of a publication also influences its prototypicality. For instance a journal paper should be more prototypical than a conference paper. This is encoded in (13).

$$\begin{aligned} \begin{aligned} \qquad prototypicality_{venue}(p_i) =&{\left\{ \begin{array}{ll} 0.5 &{} \text {if } venue(p_i) = \text {Technical Report}\\ 1 &{} \text {if } venue(p_i) = \text {Conference Proc.}\\ 1.5 &{} \text {if } venue(p_i) = \text {Journal}\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(13)

All of these three measures are combined in (14).

$$\begin{aligned} \begin{aligned} prototypicality(p_i) =&prototypicality_{venue}(p_i) \cdot \\&(prototypicality_{age}(p_i) + prototypicality_{indegree}(p_i)) \end{aligned} \end{aligned}$$
(14)

The measure for prototypicality can be normalized by dividing by the maximally possible prototypicality of a citation network. This maximum is equal to \(1.5 \cdot ((|V|-1) + (maxIn \cdot maxTSS))\) where maxIn is the maximum in-degree in the network.

The prototypicality of papers and the diversity of a set of papers are connected. If each paper in a set has a high prototypicality, most probably each paper represents a different line of research – otherwise the prototypicality values would be low for some papers in the set. This set of papers should therefore also have a high diversity. The opposite is not true in general. Imagine ten papers from completely different research groups on different topics. Moreover, let each of these papers be the first in their respective lines of research. Then the set of these papers is very diverse, while the individual papers are not prototypical.

4 Study

To determine the characteristics of papers that give an overview of a scientific topic, we conducted a study. In this study experts manually picked such papers from one of their areas of expertise. These selected papers were then characterized using the measures defined in Sect. 3.

In a pre-study the expert’s area of expertise was inquired. Additionally, three papers from their area of expertise were named by the experts. Based on this information, a list of papers for the chosen topic was created for each expert which was used in the study.

4.1 Hypotheses

The set of papers that provide an overview of a scientific topic should highlight the different subtopics and aspects of the topic. Thus, we expect a high set diversity. The selections are expected to be a trade-off between the number of represented aspects of a topic on the one hand and the depth to which they are explored on the other hand. This first aspect is represented by the coverage. Given this trade-off, we expect moderately high coverage values. We also expect that all papers are more or less similar to one another. Therefore, the average distance as calculated by the coverage is expected to have a low standard deviation. Having a set of papers from the same line of research most probably will not give a comprehensive overview of a scientific topic. Therefore, in a set of papers that give an introduction to a scientific topic we expect each paper to have a high prototypicality on average.

4.2 Pre-study

13 scientific experts from computer science – eleven PhD students, one PhD and one professor – participated in the pre-study. In the actual study experts were asked to select papers from a list that give an overview of a specific scientific field for a starting PhD student. To compile this list for each expert, the experts were asked in a pre-study to state an area of their expertise and name three English papers from that area that had been published in the years 2010 to 2013. The lower bound on the publication year was to ensure that only fairly new papers were named. The upper bound was given by the used dataset. Furthermore, the three papers together should fulfill three further criteria, whereby each paper should meet at least one criterion:

  • The expert is an author of the paper, therefore ensuring their expertise

  • The paper is known in the scientific community

  • The paper is a survey paper

These papers were then used to generate a list of papers for each expert, as described in the following.

4.3 Dataset

These lists of papers were generated by extracting citation networks from the ArnetMiner datasetFootnote 1[6]. This dataset contains papers and citations obtained from DBLP and is thus mostly limited to computer science papers. For each expert one of the three named papers was chosen – called the seed – for which the 2.5 neighborhood network was extracted. The 2.5 neighborhood network of the seed is the graph that consists of all papers connected to the seed via an undirected path of maximally length two and the corresponding edges (2 neighborhood). Moreover, any connections among these papers are included as well (0.5 neighborhood). This is a citation network.

Unfortunately, most papers named by the experts were not present in the dataset or had no or only one edge. These papers had to be discarded. Overall six participants – four PhD students, one postdoc and one professor – named one or more papers that could be used for the study. In the case that more than one appropriate paper had been named, the one that produced the largest and densest citation network was chosen. The papers chosen for the study met various of the three criteria given in the pre-study. Two of the experts – later referred to as experts two and three – had the same dataset.

The extracted citation networks were very large – in one case the network contained more than 3500 papers. As the experts were expected to manually select ten papers out of the citation network, this was not feasible. Hence, it was decided to reduce the citation networks by applying the k-core algorithm. The k-core of a graph is a subgraph in which each node has at least a degree of k. For this degree count the undirected citation network was considered. For each expert’s citation network the minimal k was selected such that the resulting k-core consisted of maximally 100 nodes. 100 was chosen since we deem it a small enough number of papers to be given to the experts. The k-core of expert six is an exception and contains more than 100 nodes since a higher k resulted in an empty graph.

Table 1. The statistics of the complete citation networks (CN) and the selected k-cores (KC) used in the study; |V|: number of nodes, |E|: number of edges

Table 1 shows the statistics of the citation networks and the selected k-cores for each expert. For experts 2 and 3 the same dataset was used, as described above. The undirected diameter – i.e. the longest shortest path in the undirected network – is always equal to four for the complete citation network. This is due to the construction using the 2.5 neighborhood of the seed paper. It can be seen that the diameters of both the directed and undirected networks are in most cases smaller in the k-cores compared to the complete citation networks. Thus, the k-cores are far more densely connected than the complete citation networks. Therefore, the datasets used in the study are biased with regard to their density.

For most papers the abstract is missing in the ArnetMiner dataset. Therefore, the abstracts were added manually for the papers in the k-cores. Duplicates were removed from the k-cores. Examples for duplicates are pre-published and published versions of the same paper. That way five papers were deleted from the k-core of expert one’s dataset, four from the dataset of experts two and three, two papers from the dataset of expert four, five papers of that of expert five and one paper from the dataset of expert six. In all cases except that of expert six the seed papers were – coincidentally – not included in the selected k-cores.

4.4 Study and Questionnaire

Each expert was given a list with the papers contained in the respective k-core. For each paper the list contained information on the title, authors, publication year, venue and abstract. From these lists the experts were asked to select ten papers that provide a PhD student that just started working on their thesis an overview of the scientific topic named by the expert in the pre-study. These papers should together cover all or most main aspects of the scientific topic. Moreover, each paper should cover at least one aspect of the topic of interest and it should be seen as an important paper by the community. In case the topic named in the pre-study was too broad to be sufficiently covered by ten papers or in case the papers covered a slightly different topic than the intended one, the experts were given the option to choose a new topic for which they selected the ten papers.

Additionally, the experts were asked to fill out a questionnaire after selecting the papers. In this they were asked how important various criteria were for their selection on a 3-point Likert scale (Not Important, Moderately Important, Very Important). These criteria were diversity of the set of papers, coverage of the set of papers, depth of the topics covered by the papers, coherence of the set of papers and average prototypicality of each paper.

4.5 Remarks

During the study expert three deviated from the original topic given in the pre-study by selecting a subtopic. Thus, although experts two and three were given the same dataset, both selected papers for different topics and their results cannot be compared to one another.

Furthermore, it should be noted that in the case of expert five the list of papers given did mostly cover a different topic than the one chosen by the expert. The original chosen topic was Automata Theory from theoretical computer science. However, the list of papers given in the study mostly dealt with Binary Decision Diagrams (BDDs). The seed paper used to generate the list of papersFootnote 2 quotes the paper that introduced BDDs in passing. Because of that, a large portion of the citation network deals with BDDs, which happens to be the selected k-core. The expert therefore selected papers that serve as an introduction to the topic of BDDs. However, this topic does not fall within the researcher’s area of expertise. Thus, the selections made by expert five have to be interpreted with caution.

5 Results

The datasets used in the the survey along with the anonymized expert selections are available onlineFootnote 3. Unfortunately, the dataset does not include the information of which type the venue of a paper is. Therefore, in the following it is always assumed that \(prototypicality_{venue}(p_i) = 1\) (cf. (13)) holds. To eliminate duplicate papers from the dataset – as described in Sect. 4.3 – the type of the venues for specific papers were looked up manually.

With regard to the author-based diversity measure entity resolution can be problematic. The same author can occur under different names in different papers, e.g. “John Smith”, “J. Smith” or “John D. Smith”. To solve this problem we converted each author’s name to the first letter of the first given name and the last name. Moreover, we converted the names to lower case. For example both “J. Smith” and “John D. Smith” were converted to “j. smith”. As our datasets are very small and focus on a single research area the likelihood that two different authors map to the same standardized name is very small.

As will be seen, expert four’s selection behaves different from those of the other expert’s selections compared to the random samples with regard to nearly all of the calculated measures. Thus, expert four might be an outlier. This might be explained by the relatively small dataset of only 30 papers given to this expert in the study.

To test the suitability of the measures, for each dataset ten random paper sets were selected. Each of these random sets consisted of ten papers – the same size as the expert selections. For these random selections the different measures were also calculated and compared to the values of the expert selections. The results of the study partly confirm our expectations (cf. Section 4.1). In the following the results of the different measures are presented and discussed.

5.1 Questionnaire

Table 2 shows the results of the questionnaire. Hereby the Likert scales have been translated to 1 (Very important), 0 (Moderately Important) and -1 (Not Important).

Table 2. The results of the questionnaire for the set diversity (Div), set coverage (Cov), topic depth of the set (Dep), set coherence (Coh) and average prototypicality of each paper (Prot)

Regarding the importance of the different criteria the experts’ answers vary. The criterion regarded to be most important is set diversity. The second most important criteria are prototypicality of each paper and set coverage. The remaining two criteria are regarded as unimportant or moderately important. Therefore, the measures presented in this paper concern the most important criteria.

5.2 Diversity

Figure 1 depicts the values of the diversity measures for both the expert recommendations and the average of the random paper selections – along with the corresponding standard deviations. The cases for which the expert selections differ significantly from the random selections are again with \(\dagger \) (p < 0.05) and \(\dagger \) \(\dagger \) (p < 0.01). This significance was measured using a one sample two-tailed t-test. The selection by expert four behaves as an outlier for most of the measures.

Fig. 1.
figure 1

The values of the diversity measures

The density-based diversity measure was calculated for \(l=1\). Larger values of l were not tested as the underlying networks are k-cores of a 2.5 neighborhood citation network. Thus, any two nodes have a maximum distance of 4 in the complete, undirected citation network. Table 1 shows that the diameter in the undirected k-cores lies between two and four. Therefore, larger values of l are of little interest as the diversity for larger values of l would be 0.

All three diversity measures depict a clear trend: nearly all expert selections have a lower diversity than the random selections. Moreover, the expert selections receive moderately high diversity scores. For the author-based diversity the selection by expert three receives a low author-based diversity value of ca. 0.37. The ten papers selected by this expert have in total a set of 18 authors, out of which seven appear in more than one paper. Out of the ten papers, only one paper has authors that exclusively contributed to this paper within the selected ten papers. All other papers have overlaps in authorship with at least one of the other selected papers. Thus, a small group of authors seems to have contributed a lot of value for the specific scientific topic.

The findings stand partially in contrast to our hypothesis. We would have expected that the expert selected paper sets have a high diversity score and in particular a higher diversity than the random paper selections. However, on second thought, a moderately high diversity makes sense. A too high diversity could easily be achieved by papers that cover completely different topics. However, this is not the target.

For instance in the case of the author-based diversity it is not surprising that a few scientists have shaped a scientific field in such a way that they are authors of more than one paper among the ten papers giving an overview to that field, while the majority of authors occurs in only one of the selected papers. With regard to the density-based diversity measure a moderately high density and therefore a moderately low diversity seems reasonable for papers that introduce a scientific field and various of its subtopics. The fact that the expert selections have a lower similarity-based diversity than the random selections means that the papers chosen by the experts are on average more similar to one another than the randomly selected papers.

Fig. 2.
figure 2

The values of the normalized coverage measures ((a)- (c)) and the corresponding standard deviations ((d) - (f))

5.3 Coverage

Figure 2 depicts the values of the normalized coverage measures (2(a) - 2(c)) for both the expert recommendations and the average of the random paper selections. As the coverage is calculated as the average distance, the standard deviations are shown, too (2(d) - 2(f)). The cases for which the expert selections differ significantly from the random selections are again marked with \(\dagger \) (p < 0.05) and \(\dagger \) \(\dagger \) (p < 0.01).

For all three variations of the coverage measure in most cases the expert selected papers have a lower coverage than the randomly selected papers. The exception is expert four’s value for the similarity-based coverage measure. However, this selection is not statistically significantly different from the random selections with regard to this measure. For all three measures all expert selections receive moderately high values. This is in accordance with the expectations. The standard deviations are in most cases not significantly different from those of the random selections. Both the expert and random selections have a small standard deviation. This is in accordance with the expectations.

Fig. 3.
figure 3

The values of the normalized prototypicality measure

5.4 Prototypicality

Figure 3 depicts the values of the normalized prototypicality measure for both the expert recommendations and the average of the random paper selections – along with the corresponding standard deviations. For the normalization the maximum value for the venue-based prototypicality was changed to one.

The cases for which the expert selections differ significantly from the random selections are again marked with \(\dagger \) (p < 0.05) and \(\dagger \) \(\dagger \) (p < 0.01). For the average prototypicality the standard deviations shown are calculated for the averages of the random set’s averages. The average of the random selections is the average of the averages. Since the random selections are of equal size, it is also the average of all randomly chosen papers. However, it should be noted that the different random samples have overlaps.

The average set prototypicality of the papers chosen by the experts have in general a higher prototypicality than the random selections. This fits our expectations. However, this difference is only statistically significant in four of six cases.

6 Conclusion and Future Work

In this paper we have presented some new measures – an author-based diversity measure, three variants to calculate the coverage of papers and a prototypicality measure – to characterize sets of papers or individual scientific papers. Moreover, we have adapted an existing measure to the domain of scientific papers: a similarity-based diversity measure. In a study experts were asked to select papers that provide an introduction to / overview of a scientific field. The measures were then applied to these expert selected papers and used to characterize papers that provide an overview of a scientific field. The values of the different measures of the expert selections were also compared to the values for randomly selected papers.

The diversity measures show that the expert selected papers in general receive lower diversity scores than random paper selections. However, the expert selections receive moderately high diversity measures. This stands partially in contrast to our hypothesis but is actually reasonable. Similar to the diversity measures, all coverage measures found that the sets of papers selected by the experts most of the time had a lower coverage than the random selections. A moderately high coverage was found in all cases for all measure variants. The standard deviation was low. This is in accordance with our expectations. The average prototypicality of the expert chosen recommendations was higher than that of the random selections. This result also confirms our hypothesis.

All of these measures seem adequate to characterize paper sets that provide an overview of a scientific topic. However, the decision to use the k-core of a citation network as data may have influenced the results of the measures. Therefore, all characterization measures need to be further evaluated in future studies with more participants and different data sets. Additionally, we would like to use the characterization measures to develop a paper recommender system. This recommender system should be tailored to present a scientist with a set of scientific papers that give an overview of a scientific field. Such a system would be a valuable support for scientific communities.