1 Introduction

The Cranfield experiments [12, 13] were conducted on a collection of 1,400 documents and complete relevance judgments for 225 topics. Since collection sizes grew substantially, complete judgments became infeasible almost immediately thereafter. The current best practice at shared tasks in IR is to create per-topic pools of the submitted systems’ top-ranked documents and then judge each topic’s pool [40]. Systems that did not contribute to the pools may then later retrieve some unjudged documents. Thakur et al. [36] recently observed this for TREC-COVID [41], where dense retrieval models in post-hoc experiments retrieved many unjudged documents that turned out to be relevant. Typical reasons for “incomplete” judgments are lacking run diversity or time constraints—which was the case for TREC-COVID as per Roberts et al. [29]. When reusing shared task data, one thus often has to deal with unjudged documents.

Fig. 1.
figure 1

Actual (obtained via post-judging) and estimated nDCG@10 of the dense retrieval model ANCE for selected TREC-COVID topics with unjudged documents.

Unjudged documents can be judged post hoc, but this can be costly and inconsistent with the original judging process. Typically, post-hoc evaluations either remove unjudged documents (condensing the results lists of a new system to the included judged documents in their relative order) [31], or the unjudged documents are assumed to either all being non- or highly relevant (naïve lower/upper bounds) [25]. Both ideas have drawbacks: Condensed lists often overestimate effectiveness [33], and the difference between naïve lower and upper bounds can be very large [25]—especially for a recall-oriented measure such as nDCG [23], one of the most reported measures for many retrieval tasks [11, 15, 17, 36]. We further show that lower/upper bounds on nDCG are potentially incomparable to results reported based on complete judgments on the same data (Sect. 3.3).

To address the outlined problems, we propose a new bootstrapping approach to estimate nDCG in the presence of unjudged documents (Sect. 3). By repeatedly sampling judgments for unjudged documents using run- and/or pool-based priors, we derive a distribution of possible nDCG scores for a retrieval system on a topic. Figure 1 compares such distributions with the estimates of condensed lists and the naïve lower/upper bounds on selected TREC-COVID topics for the dense retrieval model ANCE [43] (which retrieved many unjudged documents deemed relevant [36]). The distributions help to identify topics with an extremely unlikely naïve upper bound (Topics 3, 19, 34), or where only a few nDCG scores between the bounds are very likely (Topic 22). In an evaluation on the Robust04, ClueWeb09, ClueWeb12, and TREC-COVID collections with real and simulated unjudged documents, we show the mode of the bootstrapped nDCG score distribution to be a more accurate estimate than those obtained from condensed lists and the, often default, naïve lower bound (Sect. 4). Moreover, bootstrapped nDCG bounds can be configured to be a lot tighter than the naïve upper bound at a negligible loss of accuracy. For future nDCG evaluations with unjudged documents, we share our data and code compatible with TrecTools [28].

2 Background and Related Work

We briefly review the nDCG evaluation measure, methods for dealing with unjudged documents, and previous applications of bootstrapping in IR.

Normalized Discounted Cumulative Gain (nDCG). The nDCG [23] is one of the most widely used IR evaluation measures (e.g., in the TREC Web and Deep Learning tracks [8, 17] or in the BEIR benchmark [36]). It is a normalized version of the discounted cumulative gain (DCG) that combines result ranks and graded relevance so that lower-ranked results contribute less “gain”. The DCG is usually defined as

$$ \textrm{DCG}@k = \sum _{i=1}^k \frac{2^{ rel (d_i,q)} - 1}{\log _2(1+i)}\,, $$

where k is the maximum rank to consider, \( rel (d_i,q)\) is the graded relevance judgment of the document returned at rank i for the query q, the logarithm ensures smooth reduction, and \(2^{ rel (d_i,q)}\) emphasizes highly relevant documents. The nDCG@k normalizes a system’s DCG@k score by dividing by the DCG\(^*@k\) score of the “ideal” top-k ranking of the pool (i.e., the ranking of the judged documents by relevance). Note that the ideal ranking may easily include documents that some systems do not return in their results.

Methods to Deal with Unjudged Documents. Only a few “specialized” retrieval effectiveness measures specifically target situations with unjudged documents (e.g., bpref [4] or RBP [27]). Yet, these measures are used in only a few scenarios like the TREC 2009 Web track [8] that aimed for minimal judgment pools [6]. Most retrieval studies instead usually report measures that assume all documents in the evaluated part of a ranking to have relevance judgments (e.g., nDCG). When evaluating a new retrieval system in the scenario of such a study, retrieved documents that were not in the original judgment pool cause problems [4, 46].

Typical methods [25] to deal with unjudged documents are: (1) assuming non-relevance, (2) predicting relevance, (3) condensing result lists, or (4) computing naïve bounds. Assuming non-relevance for unjudged documents is the standard in trec_eval, but only yields good results for “essentially” complete judgments [42] and favors systems that retrieve many (relevant) judged documents [35]. Since systems that retrieve unjudged but relevant documents might be severely underestimated [46], there have been attempts to automatically predict relevance [1, 2, 5, 7] (e.g., based on document content). However, such predictions can be problematic given that even experienced human assessors can struggle [38]. Also inferred measures like infAP [44] and infNDCG [45] could be viewed as prediction approaches. They exploit the probabilities with which documents were sampled for incomplete judgment pools with reduced overall effort [39]. But inference does not really work for post-hoc evaluation of systems that did not contribute to the original pool sampling since the sampling probabilities for newly retrieved high-ranked documents then can be undefined. Still, the general idea of sampling inspired our approach.

In the condensed list approach, all unjudged documents are removed from a ranked list before calculating effectiveness. The conceptual simplicity and the experimental evidence [31] that condensed lists give better results than the specially designed bpref helped condensed lists to become widely used—also in TrecTools [28] or PyTerrier [26]. But like relevance prediction, compressed lists also have the disadvantage of hiding the potential uncertainty created by unjudged documents. This motivates approaches that make this uncertainty “visible,” such as calculating (naïve) lower or upper effectiveness bounds [25, 27].

Naïve bounds contrast the worst case with the best case by calculating the score a system would achieve if all unjudged documents were non-relevant or highly relevant. In the context of utility-based (based only on ranking) and recall-based (normalized by a “best possible” ranking) evaluation measures, the naïve bounds are designed for the former [25]. For utility-based measures, any actual effectiveness score of a system is guaranteed to be within the naïve bounds. However, for recall-oriented measures like nDCG, we show that the actual effectiveness of a system may lie outside the naïve bounds (cf. Sect. 3.3) and that expanding them often leads to meaningless 0.0 (lower) and 1.0 (upper) bounds.

Our new bootstrapping approach addresses the outlined shortcomings of the existing ideas for dealing with unjudged documents when using nDCG. By deriving a distribution of possible nDCG scores, we allow tighter bounds and more informed point estimates. Both improvements are based on the same underlying distribution of possible nDCG values, which also simplifies uncertainty assessment and interpretation.

Bootstrapping in Information Retrieval. Bootstrapping is a statistical technique in which repeated samples are drawn from data to obtain a distribution for subsequent statistical analyses [18]. It has been applied to various statistical problems in information retrieval, either as topic bootstrapping or corpus bootstrapping. Topic bootstrapping was probably the first use of bootstrapping in IR [34]. It refers to the repeated sampling of queries for some statistical analyses and has been used in significance tests [34, 35] or to assess the discriminatory power of effectiveness measures [30, 32, 47]. However, topic bootstrapping is not intended to assess the uncertainty created by unjudged documents.

In corpus bootstrapping, documents are sampled from a corpus to simulate different corpora [47]. Previous use cases of corpus bootstrapping include assessing the transferability of system comparisons between different corpora [16] or the robustness of evaluation measures [47] and significance tests [19]. The assumption underlying corpus bootstrapping is that observations should be stable between (slightly) different corpora. This inspired our idea of applying bootstrapping to evaluations with unjudged documents in the sense that an unjudged document should “behave” similarly to the judged documents in a run and/or pool. Bootstrapping has not yet been applied to the evaluation of unjudged documents, although the research reviewed above shows that bootstrapping enables similar applications. By making our code publicly available, we try to support Sakai’s call for bootstrapping to get more attention in IR [30].

3 Bootstrapping nDCG Scores

After preparatory theoretical considerations, we propose a bootstrapping approach to generate nDCG score distributions by repeatedly sampling judgments for unjudged documents. Based on the lessons learned, we then reconsider current methods for estimating lower and upper bounds and propose improvements.

3.1 Preparatory Theoretical Considerations

As briefly discussed in Sect. 2, nDCG requires judgments to be complete up to the desired scoring depth k. Unjudged documents in the top-k results of a system must therefore either be post-judged, or be estimated otherwise based on some strategy. Post-judgments are costly and may lead to inconsistencies with prior judgments. This often leaves automatically estimating unjudged documents as the most feasible practical option.

A first idea could be to simply randomly sample relevance labels for unjudged documents. But without any further corrections, this approach can lead to invalid results. For instance, consider an evaluation setting with three relevance grades \(\{0;1;2\}\) and a fictional judgment pool that contains nine highly relevant documents (grade 2), one relevant document (grade 1), and arbitrarily many non-relevant documents (grade 0) for some topic. Assume that a to-be-evaluated system A returns in its top-10 results the nine highly relevant documents from the pool and one unjudged document not part of the pool. Suppose that relevance grade 2 is randomly sampled for the unjudged document. Adding this sampled highly relevant document to the pool then improves the ideal ranking:

$$ \textrm{DCG}^*_\text {original pool}@10\quad < \quad \textrm{DCG}^*_\text {pool~with~sample}@10\,. $$

If \(\textrm{DCG}^*_\text {pool with sample}@10\) is used as the normalization denominator for computing the nDCG@10 of system A, the resulting scores are thus not directly comparable to nDCG scores of other systems calculated based on complete judgments for the original pool and \(\textrm{DCG}^*_\text {original pool}@10\). Comparability could be reestablished by recalculating the nDCG scores of the other systems using \(\textrm{DCG}^*_\text {pool with sample}@10\). Yet, recalculating scores might be biased towards the newly added system: in case the randomly sampled score is higher than the unjudged document’s true relevance, recomputing diminishes the original systems’ nDCG scores below their true value, yet increases the newly added systems’ nDCG beyond its true value.

Conversely, also using \(\textrm{DCG}^*_\text {original pool}@10\) as the denominator to maintain comparability is not valid. In the example case of system A, this would cause

$$ \textrm{DCG}_{\text {system}\,A}@10\ \>\ \ \textrm{DCG}^*_\text {original pool}@10\quad \leadsto \quad \frac{\textrm{DCG}_{\textrm{system}\,A}@10}{\textrm{DCG}^*_\text {original pool}@10}\ \ >\ \ 1\,, $$

which exceeds the range of nDCG expected from normalization.

It follows that theoretically sound and empirically viable nDCG estimation approaches to handle unjudged documents must not change the pool’s initial number of judgments per relevance grade in order to preserve the DCG\(^*@k\).

figure a

3.2 Our Bootstrapped nDCG Estimation Approach

Algorithm 1 shows our approach. It meets the constraint of preserving the number of judgments per relevance grade in the pool by restricting the random sampling of relevance degrees to a \( prior \). In each of the b bootstrap iterations, a relevance grade r is sampled for an unjudged document in the top-k ranking R from the judgment pool J according to one of three sampling \( prior \) s:

$$ \begin{array}{llcl} {\text {pool-based}} &{} P( rel = r \;|\; J) &{}\quad =&{}\quad \ \frac{|\{d \in J \;:\; rel(d,q) = r\}|}{|J|}\,, \\ {\text {run-based}} &{} P( rel = r \;|\; R) &{}\quad =&{}\quad \ \frac{|\{d \in R \;:\; rel(d,q) = r\}|}{|\{d \in R \;:\; d \text { is judged}\}|}\,,\ \text{ and } \\ {\text {pool}{\texttt {+}}\text {run-based}} &{} P( rel = r \;|\; J,R) &{}\quad =&{}\quad \ \frac{P( rel = r \;|\; J) \;+\; P( rel = r \;|\; R)}{2}\,. \end{array} $$

During sampling, our approach checks in each iteration whether the desired relevance grade r is still present in the pool. If not, the highest possible judgment that is below the desired grade is selected, with grade 0 as the default fallback option. This sampling strategy guarantees that the ideal ranking of the original pool J and the ideal ranking of the final “sampled” judgments \(J' \cup S'\) have the same DCG\(^*@k\). The bootstrapped nDCG scores for R are thus directly comparable to nDCG scores of other rankings derived from the same pool J (e.g., to completely judged runs with nDCG scores computed on the initial pool).

Efficient Implementation. Our bootstrapping approach computes nDCG scores in each iteration. To ensure efficiency, we precompute and tabulate the possible discounted gain values for each relevance grade at each of the top-k ranks, the DCG\(^*@k\) of the ideal ranking of the given pool J, and the sum of the discounted gain values of the judged documents in R—all of these values do not change during bootstrapping. The nDCG score computation can then look up the sampled discounted gain values for unjudged documents, add them to the precomputed intermediate DCG of the judged part of R, and divide by the precomputed DCG\(^*@k\) of J. On an AMD Epyc 1.8 GHz CPU, a TrecTools-based tabulated implementation of our approach takes an average of 2.84 s per topic (stddev: 0.01 s) to bootstrap nDCG@10 scores for the four runs that have the most unjudged documents in TREC-COVID (9–32% unjudged documents) as per Thakur et al. [36]—without tabulation: 17.62 s (stddev: 0.91 s). The fast run time shows that bootstrapping is practically applicable, especially since further massive parallelization is possible.

3.3 Conceptual Comparison

Our preparatory considerations from Sect. 3.1 also apply to the derivation of lower/upper bounds for nDCG. Bounds for nDCG inspired by RBP [25, 27] can be incomparable, too. Naïve bounds can easily be made comparable but we show that they and RBP-inspired bounds are not guaranteed to be correct. We thus devise guaranteed bounds, but show that they then “necessarily” are very broad.

Table 1. Examples with incorrect RBP-inspired/naïve nDCG@2 bounds or with very broad guaranteed nDCG bounds; relevance labels from 0 (not rel.) to 3 (highly rel.).

Error Bounds for nDCG. Inspired by the error bounds proposed for the utility-based measure RBP [25, 27], lower/upper bounds for nDCG may be derived by either assigning a relevance grade of 0 or the highest relevance grade to all unjudged documents. But since the latter changes the ideal ranking, such an upper bound can lead to incomparable nDCG scores. Therefore, in order to yield comparable scores, we propose that an RBP-inspired “naïve” upper bound for nDCG should iteratively greedily assign the highest still available relevance judgment from the pool to the highest ranked unjudged document. If the pool’s available non-zero grades are exhausted, 0 is assigned. This naïve bounding does not change the DCG\(^{*}@k\) and thus yields scores comparable to other rankings on the pool. However, the examples in Table 1 show that both the RBP-inspired and the naïve bounds can be incorrect. The RBP-inspired lower bound (and thus also the equivalent naïve lower bound) can be too high (first row; the actual grade of 2 for the unjudged document increases DCG\(^{*}@k\) more than DCG@k). Similarly, also the upper RBP-inspired and naïve bounds can be incorrect (first and second row). For a guaranteed correct lower bound, a hypothetical ideal ranking needs to be assumed that consists of only documents with the highest relevance grade, and all unjudged documents get a grade of 0. Computing a guaranteed correct upper bound is more complicated but in the end usually uses a different ideal ranking which makes the guaranteed bounds incomparable.

Table 2. Characteristics of methods to deal with unjudged documents in nDCG scoring. Some are deterministic, some not (Det.), and they use different strategies with pool- and/or run-based priors. All are “comparable” (i.e., do not change the ideal DCG\(^*@k\)).

Discussion. Table 2 summarizes characteristics of methods that deal with unjudged documents but that preserve the ideal ranking. The methods rely on different priors (none, pool-, run-, or pool+run-based)—some only implicitly, like the upper bound method, which uses the pools highest remaining judgments. Our bootstrapping idea incorporates priors from both run and pool, and indicates the uncertainty introduced by unjudged documents through a probability distribution. Condensed lists and naïve bounds only generate point scores.

4 Evaluation

We experimentally compare our bootstrapping approach to naïve bounds and condensed lists on real and simulated scenarios with unjudged documents on the Robust04, ClueWeb09, ClueWeb12, and TREC-COVID collections. In the comparison, we assess the ability to predict actual nDCG scores, their effects on system rankings, and the tightness of potential bounds. For score prediction and the creation of subsequent system rankings, our approach uses the most likely nDCG score from the bootstrapped distribution, for tighter bounds, our approach uses fixed percentiles in the bootstrapped distribution. All experiments use nDCG@10, since it is predominant in shared tasks and the highest cut-off for which the four collections have complete judgments for the submitted runs.

4.1 Experimental Setup

We compare a run with unjudged documents in two setups against (1) runs without unjudged documents (measuring the accuracy of lower and upper bounds), and (2) other runs without unjudged documents (measuring correlations in system rankings). Score ties in a run are solved via alphanumeric ordering by document ID (following a recommendation by Lin and Yang [24]). To reduce the impact of low-performing systems, only the 75% of runs with the highest nDCG@10 are included (following a similar setup by Bernstein and Zobel [3]). The ClueWeb corpora have a high number of near-duplicates [20] that might invalidate subsequent evaluations [3, 21, 22]. We use pre-calculated lists [20] to deduplicate the run and qrel files. We follow trec_eval and replace negative relevance judgments with 0. All experiments use TrecTool’s nDCG@10 implementation with default parameters, and we report statistical significance where applicable according to the Students’ t-test with Bonferroni correction at \(p=0.05\).

Test Collections. Our evaluation is based on four collections: (1) Robust04 [37] (528,155 documents, 249 topics, 311,410 relevance judgments, pool: 111 runs by 14 groups), (2) ClueWeb09 (1 billion web pages, 200 topics, 58,414 judgments from TREC Web tracks [8,9,10,11], pools: 32–71 runs by 12–23 groups), (3) ClueWeb12 (0.7 billion web pages, 100 topics, 23,233 judgments from TREC Web tracks [14, 15], pools: 34 + 30 runs by 14 + 12 groups), (4) TREC-COVID [41] (171,332 documents, 50 topics, 66,336 judgments).

Establishing Incompleteness. TREC-COVID allows a real case study on incompleteness. In post-hoc experiments [36], three models retrieved 17% to 41% unjudged documents in their top-10 that were post-judged [36]. For Robust04, ClueWeb09, and ClueWeb12, we simulate incomplete pools with the “leave one group out” method [38], adjusting the pool by removing documents solely contributed by the group submitting a run (i.e., only their runs have the document in the top-10 results), simulating that the group did not participate. This yields one incomplete pool per group, where runs of other groups remain fully judged.

Table 3. The prevalence of each relevance label in the judgment pool and the unjudged documents, respectively. For Robust04, ClueWeb09, and ClueWeb12, we show the simulated incompleteness averaged over groups; TREC-COVID is real incompleteness.

Table 3 provides an overview of the ratios of relevance degrees in the pools and the unjudged documents. For simulated incompleteness, we report averages over all groups. None of the collections are complete, as all have relevant documents among the unjudged ones. However, for Robust04, the high number of submitted runs and deep pooling ensured that the pools are “essentially complete”, even for simulated incompleteness (4% of the unjudged documents are relevant). The remaining collections have 20% to 33% relevant documents among the unjudged ones, providing a good range of (in)completeness for our experiments.

4.2 Evaluation Results

For nDCG prediction experiments, accuracy is reported as root-mean-square error (RMSE), contrasted by two RMSE variants that assess lower and upper bounds. Furthermore, we measure the correlation of system rankings obtained by predicted nDCG scores to the ground truth rankings as Kendall’s \(\tau \) and Spearman’s \(\rho \). For experiments on tightening naïve bounds, we measure precision and recall in reconstructing per-topic system rankings. Evaluation is first conducted on simulated incompleteness and concludes with the TREC-COVID case study.

nDCG Score Predicion. Table 4 reports the nDCG@10 prediction accuracy of all tested approaches. We report the actual RMSE, a lower-bound RMSE (ignoring underestimations), and an upper-bound RMSE (ignoring overestimations). Cases with incorrect naïve bounds occur in practice but are rare. The naïve lower bound is slightly more inaccurate than the naïve upper bound (maximum violations of 0.009 on ClueWeb09 for the lower bound vs. 0.002 for the upper bound on Robust04). Similar to the incompleteness degrees of the collections (Table 3), the actual RMSE is rather small on Robust04, larger on ClueWeb09, and the highest on ClueWeb12. Consequently, the naïve lower bound that assumes unjudged documents are non-relevant has high accuracy on both collections, but is outperformed by condensed lists on ClueWeb12 (RMSE 0.113 vs. 0.92).

Table 4. Overview of nDCG score prediction assessed by the actual RMSE, and the lower and upper bound RMSE (ignoring under/overestimations) on Robust04 (R04), ClueWeb09 (CW09), and ClueWeb12 (CW12). We report statistical significance according to Student’s t-test with Bonferroni correction at p=0.05 to the naïve lower (\(\dagger \)) and upper bound (\(\ddagger \)), respectively condensed lists (\(*\)).

Our three bootstrapping variants with a prior from the pool (Bootstr.\(_{P}\)), the run (Bootstr.\(_{R}\)), or both (Bootstr.\(_{P{\texttt {+}}R}\)) show that priors from the run yield more accurate results than from the pool, and combining both yields the highest accuracy in all cases, significantly improving upon the naïve lower and upper bound, and condensed lists. This result is reasonable, as the combination of run priors and pool priors allows the bootstrapping approach to account for relationships between the topic and the run. The results show that bootstrapped nDCG scores from run and pool priors are highly applicable in practice as they yield the most accurate nDCG predictions in all our experiments. Additionally, by comparing the lower- and upper-bound RMSE of condensed lists with those of pool/run-based bootstrapping, we observe that condensed lists are inclined to overestimate on all corpora. In contrast, bootstrapped predictions are more balanced with a tendency for underestimations, which is preferable in practice [35].

Table 5. Overview of the correlation between system rankings obtained via predicted nDCG@10 scores on incompletely judged runs to those runs with complete judgments. We report Kendall’s \(\tau \) and Spearman’s \(\rho \) on Robust04, ClueWeb09, ClueWeb12, and the mean over those three corpora.

System Ranking Reconstruction Against Incompletely Judged Runs. We contrast our experiments on the accuracy of predicted nDCG@10 scores by measuring the correlation of system rankings obtained via predicted scores on incompletely judged runs to the ground truth system ranking obtained via fully judged runs. Therefore, we predict the nDCG@10 sores of each run using the incomplete judgments for the run obtained via the “leave one group out” method [38]. Table 5 reports the correlation of the system rankings obtained on the incomplete judgments with the ground-truth system ranking measured as Kendall’s \(\tau \) and Spearman’s \(\rho \). Again, we observe that the judgment pool for Robust04 is, even with simulated incompleteness, highly reusable as all approaches (besides the naïve upper bound) achieve high correlations (pool/run- based bootstrapping having the highest Kendall’s \(\tau \) of 0.966). Our pool/run-based bootstrapping substantially outperforms condensed lists in all cases, and also achieves the highest correlation on average over all three corpora (Kendall’s \(\tau \) of 0.832).

System Ranking Reconstruction Against Fully Judged Runs. To assess pool/run-based bootstrapping for tightening naïve bounds, we compare different methods for score prediction w.r.t. their ability to reconstruct the topic-level ground-truth ranking of systems. Given a run with unjudged documents, we first calculate point estimates: the naïve lower bound, condensed list, and the most likely score according to pool/run-based bootstrapping. Then, score ranges are established, starting at the naïve lower bound and ending at different high points: the naïve upper bound, the score of condensed lists, and the upper 75%, 90%, or 95% percentiles of the bootstrapped distributions. Score ranges and point estimates for each run are compared against the scores of all other runs that contributed to the respective pool, emitting corresponding system preferences if the range/estimate is strictly below or above the exact score of another system.

Table 6 reports the reconstruction effectiveness as precision, recall, and F1 score. In recall-oriented settings, where score ranges are unsuitable, the naïve lower bound (recall of 0.954 on Robust04), or the bootstrapped prediction (recall of 0.903 on the ClueWeb12) should be used. In precision-oriented scenarios, naïve bounds achieve the highest precision at a high cost in recall (only 0.547 on the ClueWeb12). The pool/run-based bootstrapping at the 95% percentile provides significantly tighter naïve bounds (recall is always significantly better) at a negligible loss in precision (not significant in all cases). Hence, nDCG bounds can be substantially tightened without loss in accuracy using bootstrapping.

Table 6. Precision, recall, and F1 in reconstructing topic-level system rankings with unjudged documents. We report significance (Student’s t-test with Bonferroni correction at p=0.05) to the point estimate of list condensation (\(*\)) and score ranges starting at the lower bound, ending at the naïve upper bound (\(\dagger \)), resp. list condensation (\(\ddagger \)).

Real Incompleteness on TREC-COVID. As a final case study, we apply naïve bounds, condensed lists, and our pool/run-based bootstrapping to estimate the nDCG@10 of three dense retrieval models on the original TREC-COVID collection, for which the unjudged documents were post-judged [36]. The three dense retrieval systems operated in a zero-shot setting. Thus we compare them against the best run submitted to the first round of TREC-COVID, as those systems also had no access to training data.

Table 7 shows the results on the original (incomplete) TREC-COVID qrels and the post-hoc (complete) qrels for three selections of topics: (1) moderate levels of incompleteness (between 25% to 50% unjudged documents), (2) high incompleteness (more than 50% unjudged documents), and (3) all topics (only nDCG@10 scores in the setup with all topics are comparable between different systems). The original run files were not stored in the BEIR experiments [36], so we reproduced them (only minor differences for ANCE, TAS-B, and ColBERT, but for DPR, we scores were substantially different and still had unjudged documents, so we exclude DPR). The default behaviour of assuming that unjudged documents are non-relevant (i.e., the naïve lower bound) underestimates the effectiveness for all dense retrieval models. At the same time, condensed lists substantially overestimate the effectiveness (e.g., for TAS-B by 0.150). Our proposed pool/run-based bootstrapping produces the best estimates in all cases. Tightening upper bounds with bootstrapping is very valuable, as the 95% percentile of bootstrapped nDCG scores is much tighter as the naïve upper bound.

Table 7. The nDCG@10 on the original qrels (unjudged documents) from TREC-COVID and the expanded qrels (all documents judged) for topics with 25% to 50% unjudged documents (.25 to .5), topics with more than 50% unjudged documents (.5 to 1), and all topics. We report the proportion of unjudged documents (U@10), and predictions of the lower bound (Default), condensed lists (Cond.), pool/run-based bootstrapping (BS\(_{P{\texttt {+}}R}\)), and naïve and tightened upper bounds (BS\(_{P{\texttt {+}}R@95}\)).

5 Conclusion

Our new bootstrapping method to account for unjudged documents in post-hoc nDCG evaluations is efficient in practice and more effective than previous methods that derive a point estimate or bounds for a system’s true nDCG. Packaged as a TrecTools-compatible software that is publicly available, bootstrapped estimation is directly applicable to retrieval studies.

As interesting directions for future work, we want to expand our bootstrapping approach to more evaluation measures (e.g., Q-Measure, MAP, or RBP) and combine it with approaches that predict the relevance of unjudged documents based on their content. This combination could lead to more informed bootstrap priors and might also tighten the resulting bootstrapped score distributions.