1 Introduction

There has recently been interest in designing retrieval systems to rank documents with novelty and diversity: the retrieved documents should cover some set of subtopics or cover different possible intents of a query (Agrawal et al. 2009; Vee et al. 2008; Clarke et al. 2008; Radlinski et al. 2008; Chen and Karger 2006; Zhai et al. 2008; Carbonell and Goldstein 1998; Carterette and Chandar 2009; Clarke et al. 2009a). Various evaluation measures have been proposed for this task: Zhai et al. (2008) introduced variations of recall and precision that count the number of unique subtopics retrieved, Clarke et al. (2008) introduced a “nugget”-based version of DCG that penalizes systems for retrieving redundant subtopics, and Agrawal et al. proposed “intent-aware” versions of classical measures that average those measures calculated with respect to particular intents. In theory these measures can be used for optimization as well. They are based on a Cranfield-like setting in which assessors have annotated documents not only on their relevance but also with respect to subtopics, intents, or nuggets. The system is rewarded for finding documents that contain subtopics or nuggets that have not previously been seen in higher-ranked documents.

These measures have something in common: the computation needed to understand them is NP-hard (Agrawal et al. 2009; Clarke et al. 2008; Zhai et al.2008). Let \(\mathcal{S}\) be a set of subtopics, intents, nuggets, or facets related to a given query Q, and let \(\mathcal{C}\) be a corpus of documents in which each document \(\mathcal{D}\) contains zero or more elements of \(\mathcal{S}\). Those that contain zero elements are nonrelevant. Apart from the intent-aware measures, those listed above are based on comparing a value calculated over subtopics retrieved up to some rank j to the maximum value that could have been retrieved at the same rank. Finding this maximumis generally an NP-hard problem. As a result, specific decisions made in the design of a novelty/diversity retrieval system may appear to lead to worse results by these measures even when those same decisions would actually improve the experience of the user the measures intend to model.

This paper is presented in two parts. The first considers the worst-case implications of optimizing to and evaluating with NP-hard effectiveness measures. The second uses simulations to draw conclusions about the implications in the average case.

2 Worst-case analysis

Let us first define our evaluation measures using the notation above, then show how each is NP-hard. For simplicity we will refer to elements of \(\mathcal{S}\) as subtopics, though they need not literally be subtopics.

2.1 Evaluation measures

We consider four measures from the literature: S-recall and S-precision, α-nDCG, and intent-aware precision (prec-IA). Before defining them, let us follow Zhai et al. (2008) in defining \({\sc minRank}(\mathcal{S}, k)\) as the size of the smallest subset of documents in \(\mathcal{C}\) that could contain (“cover”) at least k subtopics in \(\mathcal{S}\).Footnote 1 We will use unadorned minRank for the case where \(k=|\mathcal{S}|\). We prove that computing minRank is NP-complete in the "Appendix" (Theorem 1).

2.1.1 S-recall

S-recall at rank m is defined as the number of subtopics retrieved up to a given rank m divided by the total number of subtopics (size of \(\mathcal{S}\) ) (Zhai et al. 2008):

$$ S{\hbox {-}}recall@m = \frac{|\cup_{i=1}^m {\mathcal{D}}_i|}{|{\mathcal{S}}|}. $$

Computing S-recall at an arbitrary m is polynomial time; we only need count the unique subtopics retrieved. But because \(|\mathcal{S}|\) could vary greatly from topic to topic, it is useful to look at S-recall at rank \(m={\sc minRank}(\mathcal{S},|\mathcal{S}|)\). Analogously to R-precision, S-recall at minRank has a minimum value of 0 and a maximum of 1 for every topic. It is, however, NP-complete as a consequence of minRank being so.

2.1.2 S-precision

Zhai et al. (2008) defined S-precision at rank m as the ratio of the minimum rank at which a given recall value could optimally be achieved to the first rank at which the same recall value actually has been achieved. Let \(k=|\cup_{i=1}^m \mathcal{D}_i|\). S-precision is then equivalent to \({\sc minRank}(\mathcal{S}, k)\) divided by the first rank by which at least k unique subtopics have appeared.

$$ S{\hbox {-}}precision@m = \frac{{\sc minRank}({\mathcal{S}}, k)}{m^*}, \quad \hbox {where } m^{*} = \hbox {arg}\min_j |\cup_{i=1}^j {\mathcal{D}}_i| \ge k. $$

2.1.3 α-nDCG

Standard DCG calculates a gain for each document based on its relevance and a logarithmic discount for the rank it appears at (Jarvelin and Kekalainen 2002). The nugget version for diversity evaluation defines the gain of a document in terms of the subtopics (or nuggets) it contains and the frequency with which those subtopics appear in documents ranked above it (Clarke et al. 2008). The gain is incremented by 1 for each new subtopic, and αk (\(0 \le \alpha \le 1\)) for a subtopic that has been seen k times in previously-ranked documents.

Since DCG is unbounded, it is standard to normalize it by the maximum possible value it could have given a perfect ranking of documents; this is called nDCG. In the case of α-DCG, determining that maximum appears to be NP-hard. Though the argument is not straightforward, we present a sketch in the "Appendix" (Conjecture 1).

Clarke et al. have also introduced a variant of α-nDCG called novelty- and rank-biased precision (NRBP) that is based on Moffat and Zobel’s rank-biased precision (Moffat and Zobel 2008; Clarke et al. 2009b). Rather than use an exact normalization factor, it normalizes using an upper bound on the maximum possible NRBP calculated by assuming there is an “ideal” ranking in which every document contains every subtopic. Because of this, NRBP does not have well-defined range. Note that this is not necessarily detrimental (DCG does not have a well-defined range either); the practical question is whether it affects conclusions drawn from evaluation or whether it has any effect on the way we optimize system performance.

2.1.4 Intent-aware precision

Intent-aware precision (prec-IA) is calculated by first calculating precision for each distinct subtopic separately, then averaging these precisions according to some distribution indicating the proportion of users that are interested in that subtopic. Using the notation we defined above, this may be expressed as:

$$ \begin{aligned} prec{\hbox {-}}IA@m &= \sum_{S \in {\mathcal{S}}} P(S | Q) prec_S@m \\ &= \sum_{S\in {\mathcal{S}}} P(S|Q) \frac{1}{m}\sum_{i=1}^m I(S \in {\mathcal{D}}_i) \\ \end{aligned} $$

where \(I(S \in \mathcal{D}_i)\) is 1 if and only if subtopic S appears in document \(\mathcal{D}_i\), and P(S|Q) is the probability that a user issuing query Q would be interested in subtopic S. Intent-aware measures do not penalize redundancy, but using a weighted average ensures that more desirable subtopics will influence the final value to a greater degree than less desirable subtopics.

Prec-IA is efficiently computable. Like α-DCG and NRBP, the maximum achievable value for a query is not necessarily 1.0, nor is it necessarily even clear what the maximum value is—it depends on the distribution of subtopics in documents (see Theorem 2 in the "Appendix"). However, a normalizing constant for prec-IA can be computed using a simple greedy algorithm (see Theorem 3 in the "Appendix"). Thus prec-IA is efficiently computable whether it is normalized or not.

We note that it is possible to define intent-aware versions of S-recall, S-precision, and α-nDCG. This might be valuable in cases where the query is ambiguous, so there are multiple possible intents, and each intent has its own set of subtopics or nuggets, creating a sort of hierarchy of subtopics. Clarke et al. (2009b) consider this case in their definition of intent-aware NRBP. For simplicity, we will focus on a single level of that hierarchy.

All of these measures have strengths; each contributes something unique to an overall understanding of performance. Our concern is not with the measures themselves, but with the cases at their boundaries: those topics for which we cannot properly evaluate or optimize systems because of the computational requirements. These cases cannot be averaged out; they will be a source of systemic error in our evaluations. Our goal is to begin to estimate how frequent such cases may be and what the implications of their existence are.

2.2 Approximability

An approximation algorithm is an efficiently-computable algorithm that gives an approximate solution to a hard problem. Approximation algorithms are typically evaluated by an approximation ratio expressed as the rate of growth of the ratio of the approximate solution to the optimal solution.

2.2.1 Evaluation

There is a simple greedy algorithm for calculating \({\sc minRank}(\mathcal{S}, k)\) and the normalizing factor in α-nDCG: first take the document that contains the most subtopics, then the document that contains the most subtopics that have not already been taken, and so on until k subtopics have been covered. This greedy approach is in fact roughly the best approximation that can be achieved. As we show in the "Appendix", minRank is equivalent to Set Cover, and Feige showed that set cover is inapproximable within \((1-\epsilon)\ln |\mathcal{S}|\) for \(\epsilon\) > 0 unless NP has quasi-polynomial algorithms (Feige 1998). The greedy algorithm has approximation ratio H s , where \(s=\max_{S\in \mathcal{S}} |S|\) and \(H_n = \sum_{i=1}^n 1/i\); the fact that \(H_s \le 1+\ln s\) gives the result.

While the approximated minRank or normalizing factor can therefore be quite bad, the situation is somewhat better for the measures themselves. The measures exhibit submodularity, which means they can be approximated within a constant factor of 1 − 1/e (Agrawal et al. 2009). Intuitively, even if we are overestimating the denominator by a large factor, the fact that there is a limited number of subtopics means that the marginal error in the approximate value of S-recall or S-precision decreases as that factor increases.

2.2.2 Optimization

The optimization problem is to rank documents such that S-recall, S-precision, α-nDCG, or prec-IA are maximized. The standard principle for optimization in IR is the Probability Ranking Principle, which says that ranking documents in decreasing order of probability of relevance gives the optimal expected precision and recall (and therefore R-precision and average precision and other such measures) (Robertson 1977). It can be extended to graded relevance to provide a ranking principle for DCG (Li et al. 2008). Either way, the PRP assumes that documents are relevant independently of one another, so it is not suitable for optimization of novelty or diversity rankings (Goffman 1964). Robertson illustrates this with an example of a query with two possible intents, showing that there is no PRP-based ranking that can uniformly satisfy both intents (Robertson 1977).

An optimization analog to the greedy algorithm for approximating evaluation measures is a greedy algorithm for ranking documents: given k ranked documents, the (k + 1)st should be the one that is most likely to satisfy the greatest number of previously-unsatisfied subtopics (Agrawal et al. 2009; Clarke et al. 2008; Zhai et al.2008). However, unlike the PRP, which maximizes precision and recall at every rank, a greedy document-by-document ranking principle cannot necessarily provide maximum S-recall, S-precision, or α-nDCG at every rank. This follows from the NP-completeness of the evaluation problem; if it were possible to optimize at every rank, evaluation measures would be computable with the greedy algorithm. The worst case for optimization, then, is that the system is optimized at rank \(1+\log |\mathcal{S}|\) but not at any higher rank.

Intent-aware precision is an important exception. An expanded PRP that estimates the probability of relevance of a document to each subtopic would optimize prec-IA at every rank. This is because prec-IA, in contrast to the other measures, does not explicitly penalize redundancy. We explore the consequences of this below.

2.3 Example

Suppose there are 14 subtopics and 5 relevant documents (that is, five documents that contain at least one subtopic).Footnote 2 Documents contain subtopics as follows:

$$\eqalign{ \mathcal{D}_{1} =\, & \{ S_{1} ,S_{2} \} \cr \mathcal{D}_{2} =\, & \{ S_{3} ,S_{4} ,S_{5} ,S_{6} \} \cr \mathcal{D}_{3} =\, & \{ S_{7} ,S_{8} ,S_{9} ,S_{{10}} ,S_{{11}} ,S_{{12}} ,S_{{13}} ,S_{{14}} \} \cr \mathcal{D}_{4} =\, & \{ S_{1} ,S_{3} ,S_{4} ,S_{7} ,S_{8} ,S_{9} ,S_{{10}} \} \cr \mathcal{D}_{5} =\, & \{ S_{2} ,S_{5} ,S_{6} ,S_{{11}} ,S_{{12}} ,S_{{13}} ,S_{{14}} \} \cr} $$

Let us consider each of our evaluation measures:

  1. 1.

    To calculate minRank, the greedy algorithm will take \(\mathcal{D}_3\) followed by \(\mathcal{D}_2\) followed by \(\mathcal{D}_1\), resulting in S-recall being evaluated at rank 3. The optimal is at rank 2; \(\mathcal{D}_4\) and \(\mathcal{D}_5\) cover all 14 subtopics. The approximation ratio of minRank is therefore 3/2.

  2. 2.

    S-precision at any rank depends on being able to calculate \({\sc minRank}(\mathcal{S}, k)\), where k is the number of unique subtopics observed to that rank. For \(k\,=\,7\) and \(k\,=\,8\), the greedy and optimal algorithms agree that \({\sc minRank}(\mathcal{S}, 7) ={\sc minRank(\mathcal{S}, 8)} = 1\). They also agree for \(k\,=\,12\) (the first two documents selected by the greedy algorithm): \({\sc minRank}(\mathcal{S}, 12) = 2\). But for k = 14 (in the two documents selected by the optimal algorithm) there is disagreement. The greedy approach says \({\sc minRank}(\mathcal{S}, 14) = 3\), while the optimal says \({\sc minRank}(\mathcal{S}, 14) = 2\). This means that calculating \({\sc minRank}(\mathcal{S}, 14)\) greedily for a system that place \(\mathcal{D}_4, \mathcal{D}_5\) at ranks 1 and 2 will result in an S-precision of 3/2, which is greater than 1.

  3. 3.

    The normalizing factor for α-nDCG presents a problem in that the optimal set of documents over which it is computed can depend on the rank. At rank 1, the best possible α-DCG is achieved with \(\mathcal{D}_3\) (α-DCG = 8/log2(2)). But at rank 2, the best possible α-DCG is achieved with \(\mathcal{D}_4,\mathcal{D}_5\) (α-DCG = \(7/\log_2(2)+7/\log_2(3)\)). The optimal set at rank 1 is not a subset of the optimal set at rank 2, and therefore optimal α-nDCG at every rank is unachievable by any ranking algorithm.

  4. 4.

    Assuming P(S|Q) is uniform, prec-IA is maximized by taking \(\mathcal{D}_3\) first, then \(\mathcal{D}_4\) and \(\mathcal{D}_5\). Note that no matter what rank we look at, despite the fact that we can find the maximum value, prec-IA is rather far from 1: prec-IA@1 = 0.57, prec-IA@2 = 0.54, prec-IA@3 = 0.52.

Now let us consider how the two types of evaluation interact with greedy optimization versus optimizing for S-recall at minRank. We will assume the system has perfect knowledge of subtopics, and consider two cases:

  1. 1.

    a system optimizing S-recall/S-precision, greedily taking \(\mathcal{D}_3, \mathcal{D}_2, \mathcal{D}_1\) followed by \(\mathcal{D}_4, \mathcal{D}_5\) in any order to maximize the number of unique subtopics retrieved;

  2. 2.

    a system optimizing α-nDCG/prec-IA, greedily taking \(\mathcal{D}_3, \mathcal{D}_4, \mathcal{D}_5, \mathcal{D}_2, \mathcal{D}_1\) to provide some redundancy along with new subtopics.

The first of these greedy approaches is illustrated in Fig. 1, along with the optimal ranking for S-recall at minRank and the minRanks calculated by greedy and optimal approaches.

Fig. 1
figure 1

A system that ranks documents greedily to optimize S-recall would place \(\mathcal{D}_3\) above \(\mathcal{D}_2\) above \(\mathcal{D}_1\). A system that ranks documents greedily to optimize α-nDCG would place \(\mathcal{D}_3\) above \(\mathcal{D}_4\) and \(\mathcal{D}_5\) (not shown). A system that optimizes S-recall at \({\sc minRank}(\mathcal{S})\) would place \(\mathcal{D}_4,\mathcal{D}_5\) at the first two positions. Using a greedy algorithm to determine \({\sc minRank}(\mathcal{S})\) places it at rank 3; the true value is at rank 2

Table 1 shows the complete set of evaluations for three systems: greedy systems with greedy evaluation; greedy systems with optimal evaluation; optimal system with greedy evaluation; and optimal system with optimal evaluation. Note that some of the values are greater than one for the optimal system evaluated greedily; this is because it is simply able to outperform any greedy algorithm.Footnote 3 Also note that the optimal system is uniformly outperformed at rank 1 by the greedy systems regardless of evaluation measure computation; this is because, as mentioned above, the document that is optimal at rank one (\(\mathcal{D}_3\)) is not a subset of the documents that are optimal at rank two (\(\mathcal{D}_4,\mathcal{D}_5\)). Since the system is restricted to choosing a document at rank 1 that is a subset of the documents at ranks 1 and 2, it cannot optimize at both ranks and therefore must suffer at one of them.

Table 1 Greedy and optimal evaluations for two systems that rank documents greedily and a system that optimizes for S-recall at the minimum rank

The α-nDCG case is particularly interesting. We calculated α-nDCG with α = 1/2, i.e. the second time a subtopic appears it contributes 1/2 to the document’s gain, the third time it contributes 1/4, and so on. The system that optimizes S-recall therefore has incentive to go on to find the second-best set of documents and rank them second, thereby achieving an α-nDCG greater than 1 at rank 2 with the greedy evaluation. The greedy system evaluated optimally, on the other hand, sees a decrease in nDCG despite continuing to find novel subtopics; this is because it could have retrieved all 14 unique subtopics at rank 2, and 14 unique subtopics plus 8 redundant subtopics at rank 3.

Normalized prec-IA exhibits the most extreme behavior. For the system that greedily optimizes prec-IA, it is 1 throughout. For the system that greedily optimizes S-recall, prec-IA decreases with rank. For the system that optimizes true S-recall, prec-IA increases with rank. There is no other measure that produces such wide differences in behavior, and for that reason we question the applicability of prec-IA to this task.

NRBP is not shown in the table, since it is not calculated at individual ranks but rather calculated for an entire ranking. It also has no NP-complete component, because it uses an efficiently-computable upper bound to normalize. It provides an additional interesting case, however: the system that optimizes for S-recall at minRank has an NRBP of 0.711, while the greedy S-rec/S-prec system has an NRBP of 0.673 and the greedy α-nDCG/prec-IA system has an NRBP of 0.713. All three values are fairly far from 1.0, though there is no ranking that provides a higher NRBP. Like α-nDCG, NRBP will prefer a greedy system, though it comes closer to recognizing that a greedy ranking is not the only possible approach.

The table shows that for optimization there is a firmly imposed tradeoff. When optimizing for S-recall at minRank, it is impossible to achieve perfect S-recall, S-precision, α-nDCG, and prec-IA at rank 1. When optimizing greedily for S-recall/S-precision or α-nDCG/prec-IA at each rank, it is impossible to achieve perfect S-recall at minRank. In standard retrieval problems founded on the PRP, there is an empirical tradeoff between precision and recall, but it is theoretically possible to optimize for both. For these measures there may be topics for which that is theoretically impossible; the developer is forced to choose.

This example can be generalized. If \(|\mathcal{S}| = 2^{k+1}-2\) and there are k relevant documents that are pairwise disjoint and \(\mathcal{D}_i\) contains 2i subtopics, and there are two additional relevant documents that are disjoint and that each contain one half of each \(\mathcal{D}_i\), the approximation ratio for minRank is O(k/2). As k increases, the greedily-computed S-recall for a greedy system is 1, but the true S-recall is \((2^k+2^{k-1})/(2^{k+1}-2)\), which goes to 3/4. Note that this is a constant approximation ratio for S-recall despite the logarithmic approximation ratio for minRank. This is due to the submodularity of S-recall (Agrawal et al. 2009).

3 Simulation and analysis

While worst-case analysis shows that it is possible to construct cases in which the evaluation and optimization fail, the practical question is whether such cases occur in real data, and if so, how often and to what extent they affect evaluation and optimization. Having only a small sample of subtopic queries to analyze and no theory regarding the distribution of subtopics in documents, we cannot make definitive statements. But we can run simulations of the type done in average-case complexity studies (Bogdanov and Trevisan 2006).

We report results exclusively for S-recall at minRank. S-recall is slightly simpler than S-precision and α-nDCG because it involves no parameters and is always between 0 and 1. The general conclusions hold regardless of measure.

3.1 Real data

There is little annotated data available for studying these problems. Currently two large sets exist. The first was constructed by Allan et al. (2005) for a report-writing task with a newswire corpus.Footnote 4 It comprises a set of 60 topics with about 13,000 document-level relevance judgments as well as labeled “aspects” for each relevant document. “Aspects” are defined as individually distinct pieces of relevant information. For instance, the first query is “oil producing nations” and its relevant aspects are Algeria, Angola, Azerbaijan, Bahrain, Brazil, Cameroon, Chad, China, .... Each document is labeled as to whether it is relevant to each of the topic’s aspects. The aspects were defined by the assessors themselves during the course of judging. If while judging their 10th document they discovered it contained an aspect that had not been in any of the first nine, they added it to the list of aspects for that topic. There was no limit on the number of aspects they could define; the average for a topic is 22, but two topics have over 100.

The second was assembled by NIST for the diversity task for the TREC 2009 Web track. It comprises 50 topics with about 28,000 document-level relevance judgments to web pages, with each page judged for relevance with respect to predefined subtopics (of which there were at most eight) (Clarke et al. 2009a). Subtopics largely reflect different information needs or intents of the query. For example, the query “kcs” has two subtopics relating to the Kansas City Southern railroad, two relating to two separate school districts, and one relating to an energy company.

We obtained these datasets to use as starting points. For the Allan et al. data, we treat aspects as subtopics. We consider each subtopic to be equally valuable to the user, so this problem is somewhat different from the diversity problems of Agrawal et al. and others that model a users’ interest in particular subtopics. The Web track data is closer to that diversity problem in that the subtopics are much more clearly delineated between documents. This emerges clearly when looking at the average number of subtopics documents are relevant to: for the former set, each document contains 2.7 subtopics on average; for the latter, each document contains an average of only 1.2.

Tables 2 and 3 show some example topics from the two sets along with their subtopics. Table 2 illustrates a task in which there is a clearly-defined information need, and to answer that need a system must retrieve as many unique aspects as possible. Table 3 illustrates a task in which there are multiple possible needs, and the system must be useful to users who have any of them (proportionately).

Table 2 Examples of topics from the Allan et al. data
Table 3 Examples of topics from the TREC 2009 Web track data

Among these 110 topics, there are seven that are trivial (two from Allan et al.; five from Web): they have only one relevant document, only one subtopic, or one relevant document that covers all the subtopics. We have excluded these. Additionally, there are 34 that are quasi-trivial (27 (46.5%) from Allan et al.; 7 (15%) from Web); in these, some subtopics only appear in one relevant document each, and taking those documents (and in some cases one additional document) covers the set trivially. There are seven topics for which the greedy algorithm overestimates the true minRank, with four from Allan et al. and three from Web. Therefore, 7 out of 103 non-trivial topics (6.8%) and 10% of non-quasi-trivial topics can have performance overestimated by the greedy algorithm.

3.2 Simulated topics

Starting from real topics, we simulate new topics by sampling from a space defined by the marginal distributions of subtopics within documents. Specifically, each topic can be written as a matrix T with documents on the rows, subtopics on the columns, and T ij  = 1 if document i is relevant to subtopic j or T ij  = 0 otherwise. An example is shown in Table 4. We will simulate topics by sampling uniformly at random from the space of 0–1 matrices that have the same row sums and column sums as the initial topic matrix. This ensures that even if we cannot precisely model the distribution of subtopics in documents, we can at least model the numbers of subtopics contained in each document and the number of documents each subtopic appears in.

Table 4 Part of the document-subtopic matrix for topic 18 “haiti protest”

The sampling algorithm is based on a random walk procedure described by Zaman and Simberloff (2002). It is used in ecological studies for statistical testing of hypotheses about distributions of species in regions. It is based on the observation that within a larger matrix T, a \(2\,\times\,2\) diagonal matrix \(\left[ \begin{array}{cc} 1 &\quad 0\\ 0 & \quad 1\\ \end{array} \right]\) can be changed to an anti-diagonal matrix \(\left[ \begin{array}{cc} 0 & \quad 1\\ 1 & \quad 0\\ \end{array} \right]\) (and vice versa) without altering the row or column sums. The algorithm works by sampling two rows and two columns uniformly at random, and if the \(2\,\times\,2\) matrix formed from the cells at their intersections is diagonal or anti-diagonal, changing it to an anti-diagonal or diagonal matrix (respectively). Over many iterations this randomizes the distribution of subtopics in documents while keeping the marginal sums constant.

The algorithm requires a “burn-in” period to sufficiently randomize the original matrix. After that, a large enough number of sampling iterations ensures a uniform distribution over all possible matrices with the same row and column sums as the original. We used a burn-in period of 10,000 iterations, with 1,000 additional samples from the burned-in matrix to generate random topics. Thus for any given topic, we could generate a new random topic by iterating 1,000 times starting from the burned-in matrix for that topic.

We have limited our simulations to start from the Allan et al. data. Because there are substantially more subtopics, and the variance in the number of subtopics is higher, this data provides somewhat more interesting results.

3.2.1 Results

Results on simulated topics are based on evaluating a greedy system with perfect knowledge of subtopic containment. This is because the worst case for a system without perfect knowledge is arbitrarily bad: if such a system did not retrieve any relevant documents in the top \(j = {\sc optimal}{\hbox{-}}{\sc minRank}\), but it retrieved relevant documents at the following ranks up to \(j= {\sc greedy}{\hbox{-}}{\sc minRank}\), its S-recall approximation ratio goes to infinity. We consider simulated imperfect systems in the next section.

First we investigated the probability that the greedy algorithm for minRank would overestimate the minimum rank. Figure 2 shows the proportion of sampled matrices starting from each actual topic for which the true minimum rank (found by exhaustive searchFootnote 5) was less than the greedy minimum rank. Note for some topics the probability is very high: for topic 60, over half the randomly sampled matrices were suboptimal.

Fig. 2
figure 2

Proportion of matrices sampled from the space defined by each of the baseline topics with minRank approximation ratio greater than 1

There were 19 topics (roughly one third) for which the greedy and true minimum rank matched in every sample. Overall, the greedy algorithm overestimated minRank for about 15% of sampled topics, which is a little higher than would be likely if the rate of 4 every 60 that was observed in the data is true.

Next we investigated the average minRank approximation ratio for the cases for which the greedy algorithm was suboptimal. Figure 3 shows the results for the 39 topics that were not always greedy-optimal. Topic 7 is the worst, with an average approximation ratio nearly 1.5 (minimum 1; maximum 1.667; median 1.333). Over all sampled topics, the mean approximation ratio is 1.16. The greedy is never more than 4 greater than the optimal, suggesting cases like our example above (worst case \(\log |\mathcal{S}|\)) are not occurring.

Fig. 3
figure 3

Average minRank approximation ratio when greedy algorithm is suboptimal. Queries for which the greedy algorithm is always optimal not shown

Finally we looked at the factor by which S-recall was overestimated when the rank was overestimated. Again, S-recall can only be overestimated by a constant 1 − 1/e. Figure 4 shows that the average worst case is about 1.16 times the true value. The maximum factor by which any S-recall is overestimated is 1.33, which happens to be the reciprocal of the 3/4 approximation ratio derived in our example above.

Fig. 4
figure 4

Average factor by which S-recall is overestimated when greedy algorithm is suboptimal. Queries for which the greedy algorithm is always optimal not shown

3.3 Simulated systems

As discussed above, the worst case for a system with perfect knowledge of subtopics is that S-recall is overestimated by a constant factor. The worst case for a system with no knowledge of subtopics (i.e. one that makes use of heuristics such as similarities between documents) is arbitrarily bad. Between these two extremes, we are interested in the cases of systems that use heuristics but that “look like” real systems might.

We simulated a “real” system that uses a greedy optimization approach as follows: starting with a document-subtopic matrix, we degraded it by changing each 1 indicating the presence of a subtopic i in a document j to a probability p ij drawn from a Beta prior with parameters \(\alpha_p,\beta_p\). We changed each 0 indicating the absence of subtopic i in document j to a probability q ij drawn from a Beta prior with parameters \(\alpha_q,\beta_q\). We then applied a greedy algorithm similar to Agrawal et al.’s (2009) IA-Select, which attempts to rank the documents that are most likely to satisfy previously-unsatisfied subtopics. The resulting ranked list is evaluated using S-recall.

The Beta distribution parameters \(\alpha_p,\beta_p,\alpha_q,\beta_q\) offer some control over the expected quality of the simulated system:

  • As \(\alpha_p/(\alpha_p+\beta_p) \rightarrow 1\) and \(\alpha_q/(\alpha_q+\beta_q) \rightarrow 0\), the system approaches the best possible performance.

  • As \(\alpha_p/(\alpha_p+\beta_p) \rightarrow 0\) and \(\alpha_q/(\alpha_q+\beta_q) \rightarrow 1\), the system approaches the worst possible performance.

  • When \(\alpha_p/(\alpha_p+\beta_p) = \alpha_q/(\alpha_q+\beta_q)\), the system is ranking documents randomly.

To keep the parameter space manageable, we used \(\alpha_p = \beta_q\) and \(\alpha_q = \beta_p\), increasing α p and α q exponentially from 20 to 27. For large α p and small α q , the system is better; for small α p and large α q , the system is worse. At \(\alpha_p = \alpha_q\) the performance is random.

3.3.1 Results

We selected topics for which the greedy algorithms were suboptimal on either the burned-in matrix or the original matrix. We then degraded the matrix randomly and greedily re-ranked the documents according to the procedure above.Footnote 6 We then calculated S-recall both greedily and optimally.

Figure 5 compares the mean performance measured by the greedy evaluation to the S-recall approximation ratio for topics 5 and 7, starting from their burned-in matrices. Each point is the result of averaging over 100 trials with a particular \(\alpha_p,\alpha_q\). Note that as simulated system performance degrades, we actually overestimate its performance more! This is quite disturbing, as it means that when the greedy evaluation is suboptimal, it will overestimate a bad system’s performance more than a good system’s performance. Bad systems will always appear better than they really are by a greater factor than good systems will.

Fig. 5
figure 5

Comparison of greedy S-recall to S-recall approximation ratio for topic 5 (left) and topic 7 (right) starting from burned-in matrices. Each point represents a different pair of prior parameters \((\alpha_p,\alpha_q)\) and is averaged over 100 random trials

The degree of overestimation is worse for topic 7 than for topic 5. This is because the optimal minimum rank for topic 7 is 3 (greedy is 4), while the optimal minimum rank for topic 5 is 16 (greedy is 18). With a deeper rank required for evaluation, the system has less opportunity to “catch up” after passing the optimal rank. However, topic 5 has five outlying points with very high approximation ratios. These are all points where α q is substantially higher than α p , meaning the system is a priori poor.

Figure 6 shows similar results starting from the original matrices for topics 18 and 30. Like topic 7, topic 18 has low optimal ranks (optimal 4 vs. greedy 5). Like topic 5, topic 30 has high optimal ranks (optimal 53 vs. greedy 52).

Fig. 6
figure 6

Comparison of greedy S-recall to S-recall approximation ratio for topic 18 (left) and topic 30 (right) starting from original matrices. Each point represents a different pair of prior parameters \((\alpha_p,\alpha_q)\) and is averaged over 100 random trials

4 Discussion and conclusion

We have argued that NP-complete evaluation and optimization can be a serious problem for retrieval systems. Even if the approximation ratio is constant, we can significantly overestimate the performance of a system. In particular, the worse a system is, the more likely its performance is to be overestimated. These errors are not random errors that can be averaged out by sampling more topics; they are systemic problems with evaluation and optimization in this setting.

Furthermore, there will always some topics for which it is theoretically impossible to optimize measures for every rank. As Fig. 1 and Table 1 illustrate, and as the NP-completeness of the computation implies, for some topics the optimal set of documents of size k is not a proper subset of the optimal set of size k + 1. This poses a problem for a system that is expected to rank things; it must choose just one of those ranks at which to try to optimize, because there is no consistent way to optimize for both. It is also impossible to optimize all measures simultaneously; there are firm theoretical tradeoffs in choosing to optimize for S-recall versus α-nDCG. The implication is that novelty and diversity systems must have a very clear idea of the user’s task in order to provide the best possible experience.

There are other concerns about these measures as well. For one, it is possible to “game” them in a way that is not possible with traditional document-level relevance-based measures: a dishonest researcher or developer can simply introduce a new document that is a concatenation of the entire corpus. This new document will contain every subtopic and therefore will provide the maximum value for any of these measures, though it is clearly not useful to a user. For another, the cognitive load on assessors is much higher, as they must judge each document with respect to each subtopic. This introduces much more variance in the judgments than is present in standard relevance judgments.

Assuming the measure is well-chosen for the task, for most topics there is no problem. The greedy algorithm is optimal in 93% of the cases in “real” data, and in about 85% of cases in simulated data. The problem cases are those for which the greedy algorithm is not optimal, in particular those for which a bad system is significantly overrated by the greedy algorithm. Future work may investigate characterizing the problematic topics so that results may be adjusted appropriately, though when considering additional problems described above, more fruitful work may lie in investigation of alternative representations of interdependent document relevance.