1 Introduction

The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. This corresponds to a real web search situation where the query input by the user is either ambiguous (e.g. “office” may be a workplace or a software) or underspecified (e.g. “harry potter” may be a book, a film or the main character; Clarke et al. 2009), for which the web search engine attempts to produce the first search engine result page (SERP) that satisfies as many users or user intents as possible. In fact, one could argue that every query is underspecified to some degree, unless it is a perfect representation of the underlying information need or intent.

The TREC 2009 web diversity task (Clarke et al. 2010) was the first to evaluate web search result diversification and to construct a web diversity test collection within the context of an evaluation forum. The third round of this task was concluded in November 2011. Whereas, the recent NTCIR-9 launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects (Song et al. 2011). The concluding workshop of NTCIR-9 was held in December 2011. The main differences between these two web diversity tasks are:

  • While TREC currently uses a version of Intent-Aware Expected Reciprocal Rank (ERR-IA) (Chapelle et al. 2011; Clarke et al. 2011a) as the primary evaluation metric, NTCIR uses the \(\hbox{D}\, \sharp\) evaluation framework (Sakai and Song 2011). \(\hbox{D}\, \sharp\) was chosen as a study based on a graded-relevance version of the TREC 2009 data (Sakai and Song 2011) showed that it offers several advantages over other diversity metrics such as α-nDCG (Clarke et al. 2009, 2011a). In particular, the study suggested that: (a) \(\hbox{D}\,\sharp\) measures are superior to Intent-Aware (IA) metrics in terms of discriminative power (Sakai 2006); and (b) \(\hbox{D}\,\sharp\)-nDCG rewards high-diversity systems more than α-nDCG and nDCG-IA do.

  • While TREC uses binary relevance assessments for each intent and assumes that each intent is equally popular in the evaluation, NTCIR uses per-intent graded relevance, as well as intent probabilities estimated through assessor voting. NTCIR decided to use per-intent graded relevance assessments because it was considered that the degree of relevance (e.g. highly relevant vs. marginally relevant) to each intent is important in search quality evaluation, just as the degree of relevance to each query is known to be important in traditional web search evaluation; NTCIR also decided to take the intent popularity into account in diversity evaluation based on the rationale that URLs relevant to the popular intents should be prioritised in the ranking and that possibly more space in the first SERP should be allocated to the more popular intents in comparison to the less popular ones.

  • While TREC uses 50 topics each year for evaluation, NTCIR used 100 topics for the Chinese task and another 100 topics for the Japanese task.Footnote 1 This decision at NTCIR was based on the available budget, as well as previous studies which strongly suggest that we should have many topics rather than deep-pool assessments in order to obtain reliable experimental results (e.g. Carterette and Smucker 2007; Sakai and Mitamura 2010; Sanderson and Zobel 2005; Webber et al. 2008).

Table 1 Test collections and runs used in this study

The objective of this study is to examine whether these new features of NTCIR are useful for diversity evaluation, using the actual data recently obtained from the NTCIR-9 INTENT task.Footnote 2

Our main experimental findings are:

  1. 1.

    The \(\hbox{D}\,\sharp\) evaluation framework used at NTCIR provides more “intuitive” and statistically reliable results than Intent-Aware Expected Reciprocal Rank. For measuring “intuitiveness,” we use the recently-proposed concordance test (Sakai 2012)Footnote 3 which examines how each diversity metric agrees with “basic” metrics such as precision and intent recall given a pair of ranked lists. For measuring reliability, we employ the aforementioned discriminative power, or the ability to detect significant differences while the confidence level is held constant, a method that has been used in a number of recent evaluation studies (e.g. Clarke et al. 2011a; Robertson et al. 2010; Sakai and Song 2011; Sakai 2012; Soboroff 2010; Webber et al. 2010).

  2. 2.

    Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for \(\hbox{D}\,\sharp\)-nDCG. Moreover, both intent popularity and per-intent graded relevance appear to individually contribute to the improvement, at least for \(\hbox{D}\,\sharp\)-nDCG.

  3. 3.

    Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved.

These results suggest that the directions being explored at NTCIR are valuable. While Finding 3 is not surprising in the context of traditional IR evaluation, to the best of our knowledge, we are the first to have studied the effect of topic set size for diversity evaluation.

The remainder of this paper is organised as follows. Section 2 discusses prior art related to the present study. Section 3 describes the NTCIR-9 INTENT data, as well as a graded-relevance version of the TREC 2009 web diversity data (Sakai and Song 2011) which we use in our additional experiments. The section also provides formal definitions of the evaluation metrics we discuss. Section 4 discusses the choice of diversity metrics by means of concordance tests. Section 5 discusses the effect of utilising intent popularity and per-intent graded relevance on discriminative power and on the entire system ranking, and Sect. 6 discusses the effect of reducing the topic set size. Finally, Sect. 7 provides our conclusions and discusses future work.

2 Related work

2.1 Comparing diversity evaluation metrics

The evaluation of diversified search results is a fairly new topic, although related IR tasks such as subtopic retrieval (Zhai et al. 2003) and facted topic retrieval (Carterette and Chandar 2009) have been discussed earlier. Hence, there are only a small number of existing studies that “evaluates the evaluation methodology” for search result diversification. Exceptions include recent studies by Chapelle et al. (2011), Chandar and Carterette (2011), Clarke et al. (2011a), Sakai and Song (2011), Sakai (2012) and Sanderson et al. (2010), as discussed below.

Chapelle et al. (2011) showed that ERR-IA and α-nDCG are members of a family of metrics called Intent-Aware Cascade-Based Metrics (CBM-IAs). We shall briefly discuss their work in Sect. 3.

Chandar and Carterette (2011) compared α-nDCG, ERR-IA and an Intent-Aware version of average precision using the TREC 2009 web diversity track data, and isolated different effects such as diversity, relevance and ranking by means of Analysis of Variance. Their basic idea of separating out these different effects is similar in spirit to Sakai’s concordance test (Sakai 2012) which we shall discuss and utilise later in this paper.

Clarke et al. (2011a) compared different evaluation metrics using the TREC 2009 web diversity track data in terms of discriminative power and rank correlation (i.e. how a pair of metrics resemble each other in ranking systems). The metrics examined include ERR-IA, α-nDCG and subtopic recall (which we call intent recall or I-rec) amongst others. In their experiments, intent recall was the most discriminative among all the metrics that were examined. Some limitations of this study from the viewpoint of our objective are: (1) They used only one test collection,Footnote 4 which lacked both the intent popularity and the per-intent graded relevance information, even though some of the metrics examined were capable of utilising the information; (2) They did not discuss which metrics are actually “good” (i.e. measuring what we want to measure) for diversity evaluation.

Sakai and Song (2011) added graded relevance assessments to the aforementioned TREC 2009 web diversity data, and compared different metrics in terms of discriminative power and rank correlation. They also examined the ranked lists manually to discuss the “intuitiveness” of diversity metrics. They proposed the \(\hbox{D}\,\sharp\) evaluation framework which plots an overall-relevance-oriented metric (e.g. D-nDCG) against the diversity-oriented intent recall, and showed that it can fully utilise intent probabilities and per-intent graded relevance. In particular, their results suggested that (a) \(\hbox{D}\,\sharp\) measures are superior to Intent-Aware (IA) metrics in terms of discriminative power; and (b) \(\hbox{D}\,\sharp\)-nDCG rewards high-diversity systems more than α-nDCG and nDCG-IA do. Some limitations of this study from the viewpoint of our objective are: (1) Again, they used only one test collection, although their version of the TREC 2009 data contained graded relevance assessments as well as artificial intent probabilities; (2) Their discussion of “intuitiveness” was only anecdotal and qualitative.

Sakai (2012) compared diversity metrics for an evaluation environment where each intent of a given topic is tagged with either informational or navigational. This applies to the TREC web diversity test collections but not to the NTCIR INTENT test collections. Using the aforementioned graded-relevance version of the TREC 2009 data, Sakai showed that intent recall, \(\hbox{D}\,\sharp\)-nDCG and its variant may be more discriminative than α-nDCG. More importantly, he showed that \(\hbox{D}\,\sharp\)-nDCG and its variant are more “intuitive” than α-nDCG according to concordance tests with intent recall and effective precision.Footnote 5 However, Sakai’s experiments were also limited to the case of the TREC 2009 web diversity data.

Sanderson et al. (2010) used a part of the TREC 2009 web diversity data (with the original binary relevance assessments only) to study the predictive power of traditional and diversity metrics: if a metric prefers one ranked list over another, does the user also prefer the same list? Using Amazon Mechanical Turkers as surrogate users, Sanderson et al. reported that intent recall (called “cluster recall” in their paper) is as effective as more complex diversity metrics such as α-nDCG in predicting user preferences. While their study complements the aforementioned “user-free” studies of Clarke et al. (2011a) and Sakai and Song (2011), it should be noted that the Mechanical Turkers were given a subtopic rather than the entire topic when asked to judge which of a given pair of ranked lists is better: in fact, it is not straightforward to conduct a user study for diversity metrics, as they have been designed specifically to satisfy a population of users and their intents rather than a single user.

In the present study, we use the concordance test of Sakai (2012) to discuss the “intuitiveness” of metrics, using intent recall, precision and Precision for the Most Popular intent (PMP) as the gold-standard metrics. Intent recall (as a gold-standard metric) represents the ability of a metric to diversify; precision represents the ability to retrieve documents relevant to at least one intent of a topic; and PMP represents the ability to emphasise a popular intent in the SERP. The concordance test is similar to the predictive power of Sanderson et al. in that it involves comparisons of ranked list pairs, except that we replace the Turkers with the gold standard metrics, and that we can avoid treating each subtopic as an independent topic.

While all of the aforementioned studies concern the evaluation of a diversified ranked list of URLs, we note that this is not the only possible solution to presenting diversified search results to the user. For example, Brandt et al. (2011) propose a tree-like, dynamic presentation of diversified URLs, together with an evaluation method for that particular presentation method. However, even though search engine interfaces are becoming increasingly rich (e.g. in that they can use interactive presentation, multiple blocks and media etc.), a ranked list of items remains a simple and primary result presentation method today.

2.2 Diversity evaluation forums

The diversity task was introduced at the TREC 2009 web track (Clarke et al. 2010). Fifty topics were developed by sampling “torso” (i.e. medium popularity) queries from web search logs, and their subtopics were developed using a clustering algorithm so that they cover different intents that are “nearby in the space of user clicks and reformulations.” The diversity task had a total of 20 runs from 18 different groups. α-nDCG and Intent-Aware Precision were used for evaluating the runs. Also, the TREC blog track used the same diversity metrics for evaluating diversified blog post rankings (Macdonald et al. 2010).

At the TREC 2010 web track, a version of ERR-IA was chosen as the primary evaluation metric for the diversity task (Clarke et al. 2011b). Again, 50 topics with subtopics were developed and the diversity task received 32 runs from 12 groups. The most recent TREC 2011 web diversity task was also very similar (Clarke et al. 2012): 50 topics were created (this time focussing on more “obscure” queries) and ERR-IA was continued to be used as the primary evaluation metric. The task received 25 runs from nine groups.

The above three rounds of the TREC diversity task used the ClueWeb09 corpus. The task assumes that all intents are equally likely, and that per-intent relevance is binary.Footnote 6

The INTENT task, launched at NTCIR-9 held in 2011, consisted of two subtasks: Subtopic Mining and Document Ranking (Song et al. 2011). By mining torso queries from query logs, one hundred Chinese topics (for searching a Chinese web corpus called SogouT) and another one hundred Japanese topics (for searching the Japanese portion of the ClueWeb09) were created. In Subtopic Mining, participating systems were required to submit a ranked list of subtopic strings in response to an input query. The submitted subtopic strings were pooled and manually clustered to form a set of intents for that query. Furthermore, ten assessors were hired to vote on the importance of each intent, and the votes were used to estimate the intent probabilities. The same topics, intents and the intent probabilities were used to evaluate the Document Ranking runs (i.e. diversified search results) using intent recall, D-nDCG and a linear combination of the two metrics, namely, \(\hbox{D}\,\sharp\)-nDCG. 24 runs from seven groups were submitted to Chinese Document Ranking; 15 runs from three groups were submitted to Japanese Document Ranking. For document relevance assessments, two assessors were hired for each topic, who provided per-intent graded relevance assessments on a tertiary scale: nonrelevant, relevant and highly relevant. Then these labels were consolidated across the two assessors to form a five-point relevance scale: from L0 (judged nonrelevant) to L4 (highest relevance).

Although not evaluation forums per se, diversity evaluation is also intensively discussed at the Diversity in Document Retrieval workshops.Footnote 7

3 Data and metrics

In the present study, we mainly use the NTCIR-9 INTENT Chinese and Japanese document ranking test collections and their runs (Song et al. 2011). Some statistics of the data are shown in Table 1a, b. As was mentioned in Sect. 1, these NTCIR test collections differ from the TREC diversity test collections in that they contain the intent popularity information, per-intent graded relevance assessments and topic sets that are twice as large. To examine the effects of these features, we shall create simplified versions of the NTCIR collections by removing the intent popularity information and graded relevance assessments, and also create reduced topic sets. Where applicable, we shall conduct some additional experiments using the graded-relevance version of the TREC 2009 web diversity task data which we obtained from Sakai and Song (2011): statistics are given in Table 1c. Note that we refer to the TREC subtopics as “intents” in this paper, to be consistent with the NTCIR terminology.

Another important difference between the current practices at the TREC web diversity task and those at NTCIR INTENT is that while the former uses a version of ERR-IA as the primary evaluation metric, the latter uses the \(\hbox{D}\,\sharp\) framework. Hence, the present study shall also discuss the choice of diversity metrics. First, let us formally define the evaluation metrics in question.

Intent recall (I-rec), also known as subtopic recall (Zhai et al. 2003) or cluster recall (Sanderson et al. 2010), is the number of different intents covered by a search output divided by the total number of possible intents. In this paper, as we are interested in diversifying the first SERP, we use the document cutoff of 10 for I-rec and for all other evaluation metricsFootnote 8.

Let Pr(i|q) denote the probability of intent i given query q, where ∑ i Pr(i|q) = 1. Also, let g i (r) denote the “local” gain value for the document at rank r with respect to intent i: throughout this study, we let the local gain value be 1, 2, 3, and 4 for per-intent relevance levels L1, L2, L3 and L4, respectively. In the \(D\sharp\) framework, the global gain for the document at rank r is defined as GG(r) = ∑ i Pr(i|q) g i (r). Based on GG, we can easily define D-measures, such as D-nDCG:

$$ D{\hbox{-}}nDCG=\frac{\sum^{l}_{r=1} GG(r)/\log(r+1)}{\sum^{l}_{r=1} GG^{\ast}(r)/\log(r+1)} $$
(1)

where \(GG^{\ast}(r)\) is the global gain at rank r in a “globally ideal” ranked list, defined by sorting all documents in descending order of the global gain, and l is the document cutoff, which in our case is 10. Note that exactly one (globally) ideal list is defined for a given topic.

In the \(D\sharp\) framework, D-measure values (representing the overall relevance of a SERP) are plotted against I-rec (representing the diversity of a SERP). In addition, a simple single-value metric that combines the two axes can be computed, e.g.:

$$ D\sharp{\hbox{-}}nDCG=\gamma I{\hbox{-}}rec + (1-\gamma) D{\hbox{-}}nDCG $$
(2)

where γ is a parameter, set to 0.5 throughout this study. \(\hbox{D}\,\sharp\) -nDCG is generally not so sensitive to the choice of γ as D-nDCG and I-rec are already highly correlated (Sakai and Song 2011).

Next, we define a version of ERR-IA as implemented in the NTCIREVAL toolkit.Footnote 9 (All other metrics are also computed using NTCIREVAL in this study.) Let Pr i (r) denote the relevance probability of a document at rank r with respect to intent i: in this study, we let the probabilities be 1/5, 2/5, 3/5 and 4/5 for L1, L2, L3 and L4 local relevance levels, following the aforementioned gain value setting for D-nDCG.Footnote 10 The normalised version of “local” ERR can be expressed as:

$$ nERR_{i} = \frac{\sum^{l}_{r=1} Pr_{i}(r)\prod_{k=1}^{r-1}(1- Pr_{i}(k))/r}{\sum^{l}_{r=1} Pr^{\ast}_{i}(r)\prod_{k=1}^{r}(1-Pr^{\ast}_{i}(k))/r} $$
(3)

where \(Pr^{\ast}_{i}(r)\) is the relevance probability of a document at rank r in an ideal ranked list for intent i. Note that this “locally ideal” list needs to be defined for each intent.

Finally, our Intent-Aware nERR is defined as:

$$ nERR{\hbox{-}}IA = \sum_{i} Pr(i|q) nERR_{i}. $$
(4)

Note that while nERR is a properly normalised metric, nERR-IA is an undernormalised metric: the maximum value reachable is usually less than one as a single ranked list is highly unlikely to be locally ideal for every intent (Sakai and Song 2011). This formulation of nERR-IA is slightly different from that described by Clarke et al. (2011a) but serves our purpose: normalisation is expected to slightly improve the discriminative power of the raw ERR-IA (Sakai and Song 2011) and it does not affect the concordance test as it compares two ranked lists per topic.

As we have mentioned in Sect. 2.1, Chapelle et al. (2011) showed that ERR-IA and α-nDCG are members of a family of metrics called Intent-Aware Cascade-Based Metrics (CBM-IAs). A CBM-IA discounts returned relevant documents based on relevant documents already seen for each intent: by encouraging “novelty” for each intent, it encourages diversity across the intents. On the other hand, D-nDCG is an overall relevance metric that aggregates intent probabilities and per-intent graded relevance, and does not explicitly encourage diversity. This is why I-rec, a pure diversity metric, is used to compute the summary metric \(\hbox{D}\sharp\)-nDCG. As was mentioned earlier, note that D-nDCG is plotted against I-rec at NTCIR to see whether participating systems are relevance-oriented or diversity-oriented. [This is similar to the practice at TREC where precision is plotted against I-rec (Clarke et al. 2010).] Put another way, while the relevance and diversity features are elegantly embedded in a CBM-IA, \(\hbox{D} \sharp\)-nDCG enables isolation of these properties when evaluating the participating runs.

Table 2a, b shows the Kendall’s τ and symmetric τ ap values (Yilmaz et al. 2008) for all pairs of metrics when the NTCIR-9 INTENT runs are ranked using the official data, with intent probabilities and per-intent graded relevance. τ ap is similar to τ but is by design more sensitive to the changes near the top ranks. Table 2c shows similar results for TREC 2009 with per-intent graded relevance data (but with uniform intent probabilities). It can be observed, for example, that D-nDCG and nERR-IA produce similar run rankings even though they are based on quite different principles. For example, the τ between the two metrics with the INTENT Chinese runs is .942. What we are interested in, however, is when these metrics disagree with each other, as we shall discuss below.

Table 2 τ/symmetric τ ap rank correlations between metric pairs

4 “Intuitiveness” of metrics

4.1 Concordance tests

To discuss which diversified runs are better than others, TREC primarily uses a version of ERR-IA, while NTCIR uses the \(\hbox{D}\,\sharp\) framework. Diversity metrics try to balance relevance and diversity for ranked retrieval and inevitably tend to be complex, which makes it particularly hard for researchers to discuss which metrics are “measuring what we want to measure.” To address this problem, Sakai (2012) proposed a simple method for quantifying “which metric is more intuitive than the other.”

The concordance test algorithm (Sakai 2012) is shown in Fig. 1. The algorithm computes relative concordance scores for a given pair of metrics M 1 and M 2 (e.g. nERR-IA and \(\hbox{D}\,\sharp\)-nDCG) and a gold-standard metric M GS which should represent a basic property that we want the candidate metrics to satisfy. For the purpose of our study, we consider the following three simple set retrieval metrics as the gold standards:

I-rec :

Intent recall, which represents the ability of a metric to reward diversity;

Prec :

Precision, which represents the ability of a metric to reward relevance (where a document relevant to at least one intent is treated as relevant to the topic); and

PMP :

Precision for the Most Popular intent, which represents the ability of a metric to reward the emphasis on a popular intent (where the popularity is defined based on the intent probabilities provided in the INTENT data, and only the documents relevant to the most popular intent is counted as relevant to the topic).

Note that none of these gold standards is sufficient as a stand-alone diversity metric: for example, none of them takes document ranks and graded relevance into account. The purpose of the gold standards is to separate out and test the important properties of the more complex diversity metrics.

Fig. 1
figure 1

Concordance test for comparing metrics M 1 and M 2 based on a gold standard M GS. For example, M 1(t, r 1) denotes the value of metric M 1 for run r 1 with topic t

The algorithm shown in Fig. 1 obtains all pairs of ranked lists for which M 1 and M 2 disagree with each other as to which list is better. Then, out of these disagreements, it counts how often each metric agrees with the gold standard metric. In this way, we can discuss which of the two metrics is the more “intuitive.” Moreover, as we can argue that an ideal diversity metric should be consistent with all of the above three gold standards, we will also extend the algorithm in Fig. 1 and count how often a candidate metric agrees with all three gold standards at the same time.

Note that we are treating I-rec as one of the gold-standard metrics in the concordance tests, as no other metric better represents diversity. Because \(\hbox{D}\,\sharp\)-nDCG directly depends on I-rec (2), it is no surprise that it agrees very well with I-rec. However, we have much more informative findings, as we shall discuss below.

Table 3 shows the results of our concordance tests for the two NTCIR data sets. For example, Table 3a(I) shows that, of all the ranked list pairs from the INTENT Chinese runs (there are 100 topics times 24*23/2 run pairs = 27,600 ranked list pairs in total), D-nDCG and nERR-IA disagree with each other for 4,320 pairs; D-nDCG agrees with I-rec for 54 % of these pairs while nERR-IA agrees with I-rec for 74 %; and this difference is statistically significant at α = 0.01 according to a two-sided sign test. These results suggest that nERR-IA maybe a more diversity-oriented metric than D-nDCG is (although it turns out that this does not hold for the TREC data, as discussed in Sect. 4.2). However, within the same setting where I-rec is used as the gold standard, \(\hbox{D}\,\sharp\)-nDCG far outperforms nERR-IA as the former directly depends on I-rec. From from the table, we can observe that:

  1. (I)

    In terms of the ability to reward diversity (as measured by agreement with I-rec), \(\hbox{D}\,\sharp\)-nDCG significantly outperforms D-nDCGFootnote 11 and nERR-IA, while nERR-IA significantly outperforms D-nDCG;

  2. (II)

    In terms of the ability to reward relevance (as measured by agreement with Prec), D-nDCG significantly outperforms \(\hbox{D}\,\sharp\)-nDCG and nERR-IA, while \(\hbox{D}\,\sharp\)-nDCG signIficantly outperforms nERR-IA;

  3. (III)

    In terms of the ability to emphasise a popular intent (as measured by agreement with PMP), D-nDCG significantly outperforms \(\hbox{D}\,\sharp\)-nDCG and nERR-IA, while \(\hbox{D}\,\sharp\)-nDCG significantly outperforms nERR-IA;

  4. (IV)

    In terms of the agreement with all three gold standards, \(\hbox{D}\,\sharp\)-nDCG significantly outperforms D-nDCG and nERR-IA, while D-nDCG significantly outperforms nERR-IA.

Note, in particular, the results in Table 3(IV): not only \(\hbox{D}\,\sharp\)-nDCG, but also D-nDCG (which does not directly depend on I-rec) significantly outperforms nERR-IA as a metric that emphasises diversity, overall relevance and the relevance to the most popular intent at the same time. These results are consistent across the two data sets, which strongly suggests that the \(\hbox{D}\,\sharp\) framework offers more “intuitive” evaluation than nERR-IA, provided that we accept the three gold standards as representative of the desirable properties in diversity metrics.

Table 3 Concordance results with the NTCIR-9 INTENT data

4.2 Concordance tests: additional TREC results

Is the above claim about the “intuitiveness” of metrics too strong? Do the results generalise to non-NTCIR data? To address these concerns, we conducted similar concordance tests with the graded-relevance version of the TREC 2009 web diversity data. Because this data set lacks the intent popularity information, we cannot use PMP as a gold standard metric. Hence we rely only on I-rec and Prec (i.e. diversity and relevance).

Table 4 shows the concordance results for TREC. It can be observed that the results are generally consistent with the NTCIR ones: \(\hbox{D}\sharp\)-nDCG is the most diversity-oriented metric (Table 4(I)); D-nDCG is the most relevance-oriented metric (Table 4(II)); and as a metric that rewards both diversity and relevance, \(\hbox{D}\,\sharp\)-nDCG is the clear winner (Table 4(III)). Note, in particular, that not only \(\hbox{D}\,\sharp\)-nDCG but also D-nDCG far outperforms nERR-IA in Table 4(III), by agreeing with both I-rec and Precision far more often than nERR-IA does. The only pairwise comparison that lacks statistical significance is that for D-nDCG and nERR-IA when I-rec is used as the gold standard: they perform comparably here, in contrast to our NTCIR results where nERR-IA significantly outperformed D-nDCG with I-rec as the gold standard.

Table 4 Concordance results with the TREC 2009 diversity data with graded relevance

It is generally recommended to evaluate systems with multiple evaluation metrics and thereby examine them from different perspectives. Thus we are not necessarily looking for a “one and only” evaluation metric. However, it would be fair to say that, based on our results, the introduction of the \(\hbox{D}\,\sharp\) framework at NTCIR was certainly worthwhile.

5 Graded relevance and intent popularity

5.1 Simplified test collections

We now examine the effects of incorporating per-intent graded relevance and intent popularity in diversity evaluation. In order to do this, we created four simplified versions of each NTCIR-9 INTENT test collection:

popularity + binary :

The per-intent graded relevance assessments are collapsed into per-intent binary relevance assessments, by regarding relevance levels of L1 and above as just “relevant”;

uniform + graded :

The intent popularity information is dropped and it is assumed that each intent given a query is equally likely;

uniform + binary :

Combination of the above two, which mimics the TREC diversity evaluation environment;

linear + graded :

Instead of using the absolute intent probabilities estimated through assessor voting, transform them into a linear distribution: the hypothesis is that only the relative intent popularity matters, and that accurate estimates of intent probabilities are not necessary. For example, if there are three intents ranked by the actual popularity, we reset their probabilities to 3/6, 2/6 and 1/6. Intent IDs are used for breaking ties (i.e. equally popular intents).Footnote 12

We compare these four variants with each of the original (or popularity + graded) NTCIR-9 INTENT test collection in terms of discriminative power as well as changes in the system ranking.

Given a test collection with a set of runs, discriminative power is measured by conducting a statistical significance test for every pair of runs and counting the number of significant differences for a fixed confidence level. While the original discriminative power method relied on the pairwise bootstrap test (Sakai 2006), any pairwise test inevitably results in a family-wise error rate of 1 − (1 − α)k where α is the probability of Type I Error for each pairwise test and k is the total number of run pairs. We therefore use a randomised version of the two-sided Tukey’s Honestly Significant Differences test which takes the entire set of runs into account (Carterette 2012). This test is naturally more conservative than pairwise significance tests that disregard all other available runs, but the relative discriminative powers of different metrics remain similar (Sakai 2012). We create 1,000 randomly permutated topic-by-run matrices (Carterette 2012; Sakai 2012) for estimating Achieved Significance Levels (ASLs, a.k.a. p values).

Discriminative power is a measure of reliability, or how a metric can consistently provide the same conclusion regardless of the topic set used in the experiment. Note that we are interested in metrics that are strictly functions of a ranked list of items (i.e. system output) and a set of judged items (i.e. right answers); we are not interested in a “metric” that knows one ranked list is from (say) Bing and the other is from Google and uses this knowledge to prefer one list over the other. Also, it should be stressed that discriminative power does not tell us whether a metric is right or wrong: that is why we have also discussed the concordance test earlier.

Figure 2 shows the ASL curves (Sakai 2006) with the original NTCIR-9 INTENT Chinese and Japanese data for \(\hbox{D}\,\sharp\)-nDCG, D-nDCG, nERR-IA and I-rec. The axes represent the ASLs and the run pairs sorted by ASLs, respectively, and curves that are closer to the origin represent more reliable metrics. Figure 3 shows similar graphs with the uniform + binary (i.e. TREC-like) setting. (I-rec is omitted here as it is not affected by test collection simplification.) It can be observed that nERR-IA underperforms \(\hbox{D}\,\sharp\)-nDCG and D-nDCG in all cases.

Fig. 2
figure 2

ASL curves with the official NTCIR-9 INTENT data (i.e. popularity + graded). The x-axis represents run pairs sorted by ASLs (i.e. p values); the y-axis represents ASLs

Fig. 3
figure 3

ASL curves with the uniform + binary (i.e. TREC-like) version of the NTCIR-9 INTENT data

Table 5 summarises the results of our discriminative power experiments with the official and simplified data sets, for α = 0.05. Parts (I) and (IV) correspond to Figs. 2 and 3, respectively. For example, Table 5a(I) shows that \(\hbox{D}\,\sharp\)-nDCG manages to detect 140 significant differences out of 276 comparisons (50.7 %), and that, given 100 topics, an absolute difference of around 0.07 is usually statistically significant (a conservative estimate). The results can be summarised as follows:

  1. 1.

    In all settings with the two data sets, nERR-IA consistently underperforms \(\hbox{D}\,\sharp\)-nDCG and D-nDCG in terms of discriminative power. This generalises the findings by Sakai and Song (2011) who used a graded-relevance version of the TREC 2009 diversity data. (Whereas, the discriminative power of I-rec is quite unpredictable: its discriminative power is low for Chinese and high for Japanese. One cause of this might be that the Chinese topics have fewer intents on average: see Table 1. It is clear that one cannot rely solely on I-rec for diversity evaluation.)

  2. 2.

    By comparing the results in (I) with those in (IV), it can be observed that introducing both intent popularity and graded relevance to diversity evaluation is beneficial in terms of discriminative power, particularly for \(\hbox{D}\,\sharp\)-nDCG: we obtain 140 − 125 = 15 additional significant differences for Chinese, and 41 − 35 = 6 additional significant differences for Japanese. To a lesser degree, the effect is also observable for nERR-IA: we obtain 109 − 105 = 4 additional significant differences for Chinese and 34 − 32 = 2 additional significant differences for Japanese. Thus the richness of the NTCIR data is worthwhile at least for \(\hbox{D}\,\sharp\)-nDCG. (Whereas, the trend for D-nDCG is not clear: for Japanese, its discriminative power with the TREC-like setting is actually slightly higher than that with the original NTCIR setting. But the Chinese results are in line with the general trend.)

  3. 3.

    At least for \(\hbox{D}\,\sharp\)-nDCG, both intent popularity and graded relevance appear to individually contribute to the aforementioned gain in discriminative power. By comparing (I) with (II), it can be observed that dropping graded relevance loses 140 − 129 = 11 significant differences for Chinese and 41 − 40 = 1 significant difference for Japanese; and by comparing (I) with (III), it can be observed that dropping intent popularity loses 140 − 126 = 14 significant differences for Chinese and 41 − 40 = 1 significant difference for Japanese. As for nERR-IA, dropping graded relevance loses a few significant differences (Compare (I) with (II)) but the effect of intent popularity is not observed (Compare (I) with (III)). For D-nDCG, dropping either intent popularity or graded relevance appears to have a slight positive effect on discriminative power, but the results are inconclusive. Also, the positive effects of introducing graded relevance can be observed by comparing (III) and (IV): it can be observed that the discriminative power goes up for all three metrics.

  4. 4.

    By comparing (I) with (V), we can observe that the effect of transforming the original absolute probabilities into linear values is small. The metrics become slightly less discriminative for Chinese, but actually slightly more discriminative for Japanese. This result does not contradict with our hypothesis that relative intent popularity may do just as well as absolute intent popularity in diversity evaluation, although it is difficult to draw a strong conclusion from these results alone.

Table 5 Discriminative power at α = 0.05 with the NTCIR-9 INTENT runs

Based on Finding (1), it is probably fair to conclude that \(\hbox{D}\,\sharp\)-nDCG is superior to nERR-IA not only in terms of the concordance test but also in terms of discriminative power. Furthermore, Findings (2) and (3) suggest that \(\hbox{D}\,\sharp\)-nDCG fully utilises the intent popularity and graded relevance information.

Table 6 shows the τ and symmetric τ ap values when the system rankings using the simplified test collections are compared with the original ranking (i.e. popularity + graded). While the general picture is that all of the reduced data provide system rankings that are very similar to the original ranking, there are a few more interesting observations. First, while dropping the graded relevance information seems to affect \(\hbox{D}\,(\sharp)\)-nDCG more than it affects nERR-IA (see the popularity + binary column), dropping the intent popularity information seems to affect nERR-IA more than it affects \(\hbox{D}(\sharp)\)-nDCG (see the uniform + graded column). Second, and perhaps more importantly, the linear + graded column shows that transforming the absolute intent probabilities into relative values have very little impact on system ranking. Along with Finding (4) mentioned above, this result seems to support our hypothesis that relative intent popularity may do just as well as absolute intent popularity in diversity evaluation, and therefore that accurate estimation of intent probabilities may not be necessary.

Table 6 τ/symmetric τ ap rank correlations between the official NTCIR-9 INTENT run ranking (popularity + graded) and one based on a simplified test collection

5.2 Simplified test collections: additional TREC results

For completeness, this section uses the TREC 2009 diversity data to back up the results with the simplified NTCIR data, which we discussed in Sect. 5.1 Recall that the original TREC 2009 diversity test collection lacks the intent probabilities and per-intent graded relevance assessments, and therefore that it resembles the uniform+binary data we constructed in the last section. Moreover, the graded relevance version of the same test collection (Sakai and Song 2011) corresponds to the uniform+graded data.

Table 7 shows the discriminative power results for the above two sets of TREC data: note that it corresponds to Parts (III) and (IV) of Table 5 Footnote 13. The TREC results are in line with Finding (1) in Sect. 5.1: nERR-IA is the least discriminative metric, although in this case the difference between nERR-IA and D-nDCG is negligible. As for the differences between uniform+graded and uniform+binary, the results are inconclusive. On the other hand, note that, while the discriminative power values in Table 7 are similar to the NTCIR results shown in Table 5, the performance differences \((\Updelta)\) required to achieve a significant difference at α are much higher for the TREC case. This is mainly due to the difference in topic set size, which is the subject of the next section.

Table 7 Discriminative power at α = 0.05 with the TREC 2009 diversity runs

Table 8 shows the τ and τ ap rank correlations between the uniform + binary ranking and the uniform + graded ranking for each metric. Similar to the popularity + binary column of Table 6 (where the simplified setting is compared with the official popularity + graded), dropping the graded relevance information affects D-nDCG most (τ = .933), and nERR-IA least (τ = .987).

Table 8 τ/symmetric τ ap rank correlations between the official TREC 2009 diversity run ranking (uniform + binary) and one based on uniform + graded

6 Topic set size

6.1 Reduced topic sets

In this section, we discuss the gap in terms of the topic set size between the TREC web diversity task and the NTCIR-9 INTENT task: the former uses 50 topics every year, while the latter used 100 topics for Chinese and another 100 for Japanese. As we have seen in Table 1, the INTENT Chinese topic set contains 860 intents, while the Japanese set contains 1,016 intents. Assessing per-intent relevance for such data is no easy task. Could the NTCIR organisers have used 50 topics just like TREC and yet obtained similar evaluation results?

To address the above research question, we created reduced topic sets from the INTENT Chinese and Japanese topic sets, so that each set contains 90, 70, 50, 30 and 10 topics, respectively. To consider the worst case where the topics included in the set happen to be the least useful for discriminating systems, we first ranked all 100 topics by the variance in \(\hbox{D}\,\sharp\)-nDCG across all runs, and gradually removed topics with the highest variances (i.e. the most informative topics). A similar topic set reduction method was used by Sakai and Mitamura (2010).

As we have already shown that \(\hbox{D}\,\sharp\)-nDCG and D-nDCG are superior to nERR-IA in terms of the concordance test and discriminative power, we will focus henceforth on the \(\hbox{D}\,\sharp\) evaluation framework.

Figures 4, 5, 6 show the effect of topic set reduction on the ASL curves for \(\hbox{D}\,\sharp\)-nDCG and its components, i.e. D-nDCG and I-recall. It can be observed that the discriminative power of each metric steadily declines as the topic set is reduced. (There is an anomaly in the graph for D-nDCG Chinese, where the “90 topics” curve actually slightly outperforms the original “100 topics” curve. Similarly, in the graph for D-nDCG Japanese, the “30 topics” curve slightly outperforms the “50 topics” curve.) Based on these graphs, Table 9 shows the discriminative power results for α = 0.05. It can be observed that removing just ten topics can result in the loss of some significant differences. (While removing 10 topics that are the least informative may not affect the outcome of significance testing this much, note that we have no established way of knowing in advance which topics are useful.) Moreover, if we compare the “100 topics” column (i.e. NTCIR setting) with the “50 topics” column (i.e. TREC setting), it can be observed that the discriminative power is roughly halved if we remove 50 topics: for Chinese, the discriminative power of \(\hbox{D}\,\sharp\)-nDCG goes down from 50.7 to 27.2 %; for Japanese, it goes down from 39.0 to 16.2 %. These results show that using 100 topics for evaluating the INTENT runs was certainly worthwhile in terms of statistical significance testing.

Fig. 4
figure 4

ASL curves for \(\hbox{D}\,\sharp\)-nDCG with reduced topic sets

Fig. 5
figure 5

ASL curves for D-nDCG with reduced topic sets

Fig. 6
figure 6

ASL curves for I-rec with reduced topic sets. The curves for 10 topics lie above ASL = 0.1

Table 9 Effect of topic set reduction on discriminative power at α = 0.05

Statistical significance is always associated with some probabilities of errors, however. Let us therefore disregard statistical significance for a while and focus on the entire system ranking. Can we preserve the official INTENT run rankings using fewer topics?

According our experimental results, the answer to the above question is no. Table 10 shows the τ and symmetric τ ap values when the system rankings based on the reduced topic sets are compared with the original ranking with the full topic set. It can be observed that the system rankings cannot be preserved even with 90 topics: the results are better for the Japanese case, but recall that the Chinese results should be regarded as more representative as they involved more teams and runs (Table 1). It can be observed that the Chinese run ranking in terms of I-rec collapses completely when only 10 topics are available (τ = .058).

Table 10 τ / symmetric τ ap rank correlations between the official system ranking and one based on a reduced topic set

Figures 7 and 8 provide more information on the effect of topic set reduction on the system ranking for 90 topics and for 50 topics. For example, Fig. 7 (left) shows that removing 10 topics would swap the top two runs, and that the run officially ranked number 5 will go down to rank 11. This also suggests that the construction of 100 topics for each language at NTCIR was well worth the effort. Moreover, note that with 50 topics, the system rankings are very different. For example, the τ for I-rec with Chinese rankings (100 topics vs. 50) is below 0.6; see also the wild disagreements in Figs. 7 and 8 (right). This suggests that, if the TREC diversity task also used 100 topics in one round, the run rankings might have been substantially different from those that have been officially announced.Footnote 14

Fig. 7
figure 7

Effect of topic set reduction on system ranking (Chinese)

Fig. 8
figure 8

Effect of topic set reduction on system ranking (Japanese)

In short, using more topics pays, both from the viewpoint of significance testing and that of obtaining a reliable system ranking. This is of course in line with literature in traditional IR, e.g. Carterette and Smucker (2007), Sanderson and Zobel (2005), Webber et al. (2008), but to our knowledge, no other studies have looked at this issue for diversity evaluation.

6.2 Reduced topic sets: additional TREC results

For completeness, we finally report on a topic set reduction experiment with the TREC 2009 diversity data with graded relevance. As the test collection has only 50 topics, we constructed reduced topics sets of 30 and 10 topics using the variance-based method we described in Sect. 6.1

Figures 9, 10, 11 show the effect of topic set reduction on the ASL curves for \(\hbox{D}\,\sharp\)-nDCG and its components, which should be compared with Figs. 4, 5, 6. Because D-nDCG loses its discriminative power very rapidly with this data set, its ASL curve for 10 topics is not visible in Fig. 10. Based on the ASL curves, Table 11 shows the discriminative power results for α = 0.05, in a way similar to Table 9. Again, it can be observed that D-nDCG loses its discriminative power very rapidly here: while its discriminative power for the NTCIR data with 30 topics is 22.5 % for Chinese and 11.4 % for Japanese (Table 9), that for the TREC data with 30 topics is only 4.3 %. This suggests that the TREC runs are relatively similar to each other in terms of the ability to return relevant documents at least for that particular reduced topic set. However, the overall results are consistent with the NTCIR ones: losing 20 topics means losing a large number of significant differences.

Fig. 9
figure 9

ASL curves for \(\hbox{D}\,\sharp\)-nDCG with reduced topics sets (TREC)

Fig. 10
figure 10

ASL curves for D-nDCG with reduced topics sets (TREC). The curve for 10 topics lies above ASL = 0.1

Fig. 11
figure 11

ASL curves for I-rec with reduced topics sets (TREC)

Table 11 Effect of topic set reduction on discriminative power at α = 0.05 (TREC 2009 graded relevance)

Finally, Table 12a shows how topic set reduction affects the TREC run ranking in terms of τ and τ ap : note that, unlike Table 10, the baseline ranking is based on 50 topics rather than 100. Thus, for reference, Table 12b,c shows similar results with the NTCIR data, where the rankings based on 50 topics are treated as baselines. It can be observed that, as before, topic set reduction gradually and surely destroys the “original” ranking. The I-rec ranking for the Chinese INTENT runs is completely different even from the ranking based on 50 topics (τ = .051).

Table 12 τ / symmetric τ ap rank correlations between the run ranking with 50 topics and one based on a smaller reduced topic set

7 Conclusions and future work

In contrast to traditional IR evaluation, diversity evaluation has only a few years of history. In this study, we examined some features of the NTCIR-9 INTENT Document Ranking task that differ from those of the TREC web track diversity task, namely, the use of the \(\hbox{D}\,\sharp\) evaluation framework, intent popularity, per-intent graded relevance, and 100 topics per language.

Our main experimental findings are:

  1. 1.

    The \(\hbox{D}\,\sharp\) evaluation framework used at NTCIR provides more “intuitive” and statistically reliable results than nERR-IA. For measuring “intuitiveness,” we used the concordance test which examines how each diversity metric agrees with gold standards, namely, intent recall (representing the ability to reward diversity), precision (representing the ability to reward overall relevance) and precision for the most popular intent (representing the ability to emphasise a popular intent). Our results showed that not only \(\hbox{D}\,\sharp\)-nDCG but also D-nDCG (which does not depend directly on intent recall) far outperforms nERR-IA in terms of simultaneous agreements with the gold standards. As for reliability, we showed that \(\hbox{D}\,\sharp\)-nDCG and D-nDCG consistently outperform nERR-IA in terms of discriminative power, i.e. the ability to detect significant differences while the confidence level is held constant. It should be noted that our concordance and discriminative power results are consistent across the two INTENT data sets and the TREC 2009 diversity data. Thus, while nERR-IA has an elegant user model, the \(\hbox{D}\,\sharp\) approach is also clearly useful for diversity evaluation.

  2. 2.

    Utilising both intent popularity and per-intent graded relevance as is done at NTCIR improves discriminative power, particularly for \(\hbox{D}\,\sharp\)-nDCG. Both intent popularity and per-intent graded relevance appear to individually contribute to the improvement, at least for \(\hbox{D}\,\sharp\)-nDCG. On the other hand, as our linear intent probability setting did not appear to have a substantial impact on concordance tests, discriminative power and the entire system ranking, it is possible that relative intent popularity information may suffice for sound diversity evaluation. Put another way, accurate estimates of intent probabilities may not be a necessity.

  3. 3.

    Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved.

These results suggest that the directions being explored at NTCIR are valuable.

As diversity evaluation is still in its infancy, it would probably be prudent at this stage to be open to various possibilities and to learn from best practices at different diversity evaluation forums. Footnote 15 While the present study highlighted the benefits of the new features of the NTCIR INTENT task as compared to the TREC web diversity task, TREC also has some features that NTCIR currently lacks. Perhaps the most important are the ambiguous and faceted tags for the topics and the navigational and informational tags for the intents (subtopics). Clarke et al. (2009) have discussed the explicit use of the ambiguous and faceted tags for diversity evaluation, while Sakai (2012) has proposed to utilise the navigational and informational tags. Whether these new approaches will bring benefit to evaluation forums like TREC and NTCIR is yet to be verified.

Another important topic that we did not cover in this study is the effect of defining the intent sets for each topic on the evaluation outcome. For example, while the INTENT Chinese and Japanese topic sets had 20 equivalent topics, the intents identified (through the Subtopic Mining subtask) are quite different across the two languages (Song et al. 2011). Moreover, it is always difficult to define what the “right” granularity of an intent/subtopic should be. This research question could be addressed to some extent by devising multiple intent sets for the same topic set and then comparing the evaluation results. A more challenging question would be: can we evaluate diversified IR without explicitly defining intents? For example, can we design intent-free, nugget-based evaluation approaches (Pavlu et al. 2012; Sakai et al. 2011) for diversity evaluation?

Finally, although we did discuss the “intuitiveness” of evaluation metrics by means of the concordance test, we do not deny that real users are missing in this study (although the topics were mined from real user queries and at least some of the intents arise from user session and clickthrough data). Clickthrough-based and user-based verification of diversity evaluation metrics would be useful complements to our work.