1 Introduction

Information Retrieval aims to provide the best possible results to meet a user’s information need. Although keyword search and relevance rankings have been proven to be powerful tools to identify a user’s interest and produce result lists with relevant pages, these mechanisms fail in certain situations. Ambiguous query terms are examples where relevant documents can not be reliably assessed without additional information from the user. Systems have to estimate a suitable relevance score and then rank accordingly. The most common way to produce these rankings is to follow the probability ranking principle (PRP) (Robertson 1977), which favors documents that are more likely to contain relevant information. For queries where the relevance scores for documents entail a lot of uncertainty, relevance rankings tend to leave a great deal of users unsatisfied: they abandon the query. Result diversification can reduce this effect (Das et al. 2008) significantly.

Ambiguous queries are not the only reason why search engine results should reflect diversity. Queries like “Napoleon” or “immigration” are less ambiguous but rather multi-faceted. To capture different aspects of such queries a result set must contain diverse information and avoid semantically similar content within the top-k results. A truly diverse ranking then also offers an overview of the whole topic including various aspects and views.

Ideally, Web search results are not biased towards a certain interpretation or aspect. However, depending on the algorithm for assigning relevance scores certain interpretations of queries may be represented disproportionally high within the result set. For ambiguous queries such as “jaguar” or “java”, commercial search engines nearly exclusively present documents about one interpretation (“car” and “programming”). Reducing the influence of the (manipulable) relevance score by combining it with a diversity aware component can help more users to find what they are looking for.

As described by Wang and Zhu (2009), diverse rankings can be seen as the result of ranking under uncertainty where the user’s information need cannot be ultimately defined. In the context of ambiguous queries, a system has to make a trade-off between the relevance of an isolated document and the risk involved of missing relevant aspects of a query. This task is tackled by Wang and Zhu by applying Modern Portfolio Theory (Markowitz 1952), which is an economic theory that describes how to minimize the risk by not “putting all one’s eggs in one basket”, but on different investments. For ranking, this means to not favor one interpretation or aspect of a query over all others but prefer a diverse ranking.

Although diversifying Web search results has recently attracted a lot of interest within the research community (Agrawal et al. 2009; Gollapudi and Sharma 2009; Rafiei et al. 2010; Wang and Zhu 2009), automatic evaluation of diversity is still an open problem. Following (Clarke et al. 2008), the TREC community has designed a task for subtopic retrieval in 2009 within the Web track (Clarke et al. 2009). The evaluation is based on subtopics of a query. These were identified using a query log of a commercial search engine and co-clicks, related queries and other information to find the different users’ information need for each query. This also includes some manual judgement of the extracted subtopics. One of the drawbacks of this approach is the rather sparsely annotated data which makes it difficult to use for judging commercial search engines’ results. The extraction process for the subtopics is also susceptible to missing aspects/subtopics of a query. The major drawback, however, is the need for manual judgement of whether a given Webpage covers a subtopic sufficiently or not. These judgements are cumbersome and costly.

In this paper we present a topic-centered approach for evaluation in contrast to the user-centered approach used in TREC. We propose an evaluation framework based on the Wikipedia encyclopedia and evaluate the diversity of Web search results for queries derived from titles of Wikipedia disambiguation pages. The coverage of the different aspects for a query present in Wikipedia is quantified using different entropy-based measures. We compare this evaluation setting with the TREC evaluation framework and a manually evaluation based on Wikipedia. We show that the obtained results are comparable with less costs and without having access to a large query log.

In addition, we present an approach to diversify search results by reranking. We estimate the relevance score of a document by its position in the original ranking and introduce a second score to reflect the additional diversity the document could add to the result list. This score is based on the variance of the underlying model for the document representation. We investigate language models and topic models (Blei et al. 2003), which have been shown to be useful document representations in the context of information retrieval tasks (Wei and Croft 2006; Zhai and Lafferty 2004). The main contribution of this paper are:

  • We present an approach to reranking based on the original rank and the variance on two different document representations: Latent Dirichlet allocation and smoothed language models.

  • We propose and evaluate an evaluation framework to automatically assess diversity using ambiguous Wikipedia titles.

  • We introduce entropy and Kullback-Leibler divergence as measures for diversity evaluation.

We show that the variance-based reranking outperforms the original rankings of two large commercial search engines with respect to diversity within the top-k results. Moreover, we show that latent topic models achieve competitive diversification requiring significantly less reranking. By comparing the proposed Wikipedia-based evaluation framework with the TREC subtopic retrieval evaluation we see comparable results without the need for a large-scale manual annotation effort.

2 Related work

Search result diversification has received considerable attention in the past years; for recent overviews on the main issues and current approaches (see, for example, Radlinksi et al. 2009).

2.1 Diversifying search results

One of the first works on result diversification introduces Maximum Marginal Relevance (MMR) as a ranking measure that balances relevance as the similarity between query and search results with diversity as the dissimilarity among search results (Carbonell and Goldstein 1998). Notably, MMR has not only been successfully used for diversity aware ranking, but also for text summarization, by selecting relevant and diverse text passages that cover the main topics or aspects of a text. Top-k diversification as pursued in this paper has the similar goal of covering the main aspects of a query by the top ranked search results.

Other approaches like (Ziegler et al. 2005) diversify recommendation lists to accommodate a users full spectrum of interests and minimize redundancy among the recommended items. Reranking approaches to diversify search results, e.g. (Radlinski and Dumais 2006) is based on query reformulations obtained from a query log where the focus lies on personalized search results, or (Chen and Karger 2006) who describe a Bayesian reranking approach to maximize the coverage of different meanings of a query in the top 10 results have been explored before. Zhai and Lafferty (2006) use statistical models for queries and documents. They model user preferences as loss functions and the retrieval process as a risk minimization problem. They retrieve models for subtopic retrieval that take dependencies between search results into account.

More recent approaches to diversification all essentially balance relevance with diversity, but differ in estimation of relevance and similarity, and choice of diversification objective. Agrawal et al. (2009) classifies queries and results to categories of the ODP taxonomy, and diversifies results by maximizing the sum of categories covered by the top-k results weighted by the probability of categories given the query. Thereby the risk that the top-k results contain no relevant result for some category at all is minimized. Gollapudi and Sharma (2009) introduce a framework for analyzing approaches to diversification as variants of facility dispersion. On this basis they analyze and evaluate three diversification objectives: MaxSum, which takes into account all pairwise dissimilarities between top-k results as a measure for diversity, MaxMin, which maximizes just the minimum relevance and dissimilarity of results, and MonoObjective as a weighted aggregation of relevance and average dissimilarity for each top-k result. Wang and Zhu (2009) introduces an approach to search result diversification adopting the Modern Portfolio Theory of finance. They generalize the well-known probability ranking principle (PRP) (Robertson 1977) by maximizing not only the relevance of top-k results but also minimizing the (co-)variance of the results. A greedy algorithm is used for ranking search results such that relevance is maximized while variance is minimized. Rafiei et al. (2010) introduces a similar framework based on Portfolio Theory for reranking Web search results. Other than the greedy algorithm used in Wang and Zhu (2009) they use quadratic programming optimization for arriving at optimal portfolios.

Very recently, Santamaría et al. (2010) investigated the use of Wikipedia to improve diversity in Web search results. They manually annotated the top 100 results of a Web search engine for a set of 40 nouns with Wikipedia senses extracted from disambiguation pages. They showed that Wikipedia senses cover 56% of the Web pages and thus Wikipedia is much more suited than other sense inventories like WordNet (32%). Additionally, they propose using a vector space model and cosine similarity or word sense disambiguation algorithms to assign a Wikipedia sense to each page. Maximizing the number of different Wikipedia senses is then the goal of their greedy reranking algorithm.

There are a couple of approaches based on topically clustering the search results first and then diversifying based on the cluster information. Carterette and Chandar (2009) propose to use probabilistic models to cover different facets of a query in the top-k results. Among others they use Latent Dirichlet allocation (LDA) to cluster documents based on the extracted latent topics which they consider to be subtopics. They do not use a variance-based approach as proposed in this paper. Another clustering approach based on LDA is presented in He et al. (2011). For each query LDA is applied and documents are assigned to latent topics with each topic constituting a cluster. For the diverse ranking, clusters are ranked, from which documents are picked. A cluster approach based on k-means clustering is described in Bi et al. (2009). The documents are clustered and the diversified ranking is produced by picking documents from each cluster based on its size.

The problem of result diversification is also investigated in the area of structured data queries. Recommending a set of items to the user or returning a list of products in response to a keyword query are applications for result diversification. Vee et al. (2008) propose an efficient algorithm to find a representative, diverse set of top-k results for a given form-based query. All attributes of an object are ordered according to their priority for diversification by a domain expert. Jain et al. (2004) make use of k-nearest neighbor clustering techniques and combine it with a notion of diversity based on a distance metric. Each query is represented as a point in an n-dimensional space and the k-nearest neighbors are selected which also satisfy the required distance. Demidova et al. (2010) go a step further by introducing an approach that diversifies keyword queries against structured databases based on their schema rather than diversifying the results. A necessary condition for these approaches is that the database schema captures the semantics of the domain at hand.

In this paper we follow the approach of Wang and Zhu (2009) to minimize (co-)variance for maximizing diversity, but rely on the search engine ranking for estimating relevance rather than estimating it from the documents directly. As the relevance estimated from the original ranking and the variance are typically on a different scale, this requires to normalize the variance, in order to balance relevance and variance. Moreover, we evaluate to what extent a condensed representation in terms of latent topic models can capture diversity better than the language modeling approach used in Wang and Zhu (2009).

2.2 Diversity evaluation

Evaluation of result diversification requires new measures that consider more than just simple relevance judgements. To this end, several extensions to traditional measures have been proposed. Their common idea is that queries and documents cover several subtopics (also called aspects or nuggets), and thus relevance is assessed w.r.t. subtopics rather than w.r.t. documents.

Chen and Karger (2006) evaluate their approach on different TREC tasks (robust track, interactive track, and manually annotated TREC data). In Radlinski and Dumais (2006) user assessment of the result is used to measure whether the diversified result list contains at least one document satisfying the user’s interest. Zhai et al. (2003) introduce variants of recall and precision that take into account the subtopics of a query. S-recall at K measures the proportion of subtopics covered by the top-k results, and S-precision at r measures the ratio of the best rank K opt that can be achieved for a given recall r and the actual rank K with recall r.

Clarke et al. (2008) introduce α–nDCG as a generalization of the nDCG measure (normalized Discounted Cumulative Gain). Whereas nDCG only measures the relevance of search results, discounted by the logarithm of their rank, α–nDCG in addition penalizes repeated subtopics in search results. For evaluating diversification of search engine results the required explicit relevance assessments in terms of subtopics are difficult to acquire. Gollapudi and Sharma (2009) avoid the need for human relevance assessment by taking Wikipedia pages returned for a query as subtopics, and estimate subtopic relevance by a thresholded similarity between result documents and Wikipedia pages to measure S-recall (also called novelty). In this paper we also compare original and diversified rankings with respect to Wikipedia, but estimate “subtopic coverage” directly on the language models of the top-k results and Wikipedia.

3 Diversification by reranking

Our goal is on the one hand to cover for each query as many different aspects as possible within the top-k search results. On the other hand, ranking of Web pages is predominantly done by picking the most topically relevant Footnote 1 pages for a keyword query according to the probability ranking principle (Robertson 1977). A diverse search result cannot neglect the relevance aspect. Thus, the relevance of Web pages for a user’s query still plays an important role. A trade-off between relevance and diversity (Cooper 1994) is incorporated within our system to accommodate this mutual relation.

3.1 Overall approach

In its most simple form the probabilistic ranking principle assumes that the usefulness of each individual result only depends on the query, but does not depend on the other results. Under this assumption, given a good estimate of the relevance E(r i ) for each result r i individually, ordering the results by decreasing E(r i ) is optimal. However, especially for Web search results, this assumption clearly does not hold. In the extreme, if the most relevant result is duplicated, the top results will all be the same, with all but the first one not adding useful information. More generally, if results overlap with each other, the top results will often be pre-occupied by one interpretation of a query. Thus, the general goal of diversification is to balance between relevance of individual results and their overlap.

One popular approach to this end is to minimize the mutual overlap between the top k results, using some similarity measure such as Jaquard similarity or cosine similarity. We adopt a closely related approach, originally introduced in Wang and Zhu (2009), which maximizes the expected relevance E(R k ) and minimizes the variance Var(R k ) for the top k documents of a search result R n  = r 1, ..., r n :

$$ E(R_k) - B*Var(R_k) $$
(1)

where B regulates the trade-off between relevance and diversity. Expected relevance E(R k ) and variance Var(R k ) are calculated as weighted sum over the individual results r i :

$$ \begin{aligned} E(R_k) &= {\sum}_{i=1}^{k} w_i E(r_i) \\ Var(R_k) &= {\sum}_{i=1}^{k} {\sum}_{j=1}^{k} w_i w_j c_{i,j} \\ \end{aligned} $$

where c i,j is the covariance of results r i r j , which is calculated based on their vector representation (7), and w i is a normalized discount factor (Järvelin and Kekäläinen 2002):

$$ w_i = \frac{1}{log_2(i+1)\sum_{j=1}^{n}\frac{1}{log_2(j+1)}} $$
(2)

\(\frac{1}{log_2(i+1)}\) is 1 for rank i = 1 and decreases monotonically, the second factor in the denominator normalizes the sum of all w i to 1.

Diversity is inversely proportional to variance: A small variance Var(R k ) corresponds to large diversity, because all diverse aspects of a query are covered more or less equally. In the extreme, when all aspects, as represented by their topical terms (see Sect. 3.2.1) occur equally often, the variance is 0. B controls the relative importance of diversity versus relevance. For B > 0 relevance and diversity are balanced against each other. In particular for ambiguous queries, choosing relevant and at the same time diverse and complementary documents with high E(R k ) and low Var(R k ) reduces the risk that the top k results do not contain any relevant document at all for some of the possible query interpretations. With very large B, the original ranking is practically overrun, and the top (few) k results will cover any topic that occurs somewhere in the complete search result. However, in our experiments giving equal weight to the original ranking and variance typically achieves good topic coverage, which does not significantly improve with increasing B (see Sect. 5.2). On the contrary, large B can even hurt topic coverage, because documents with low relevance very often do not cover any relevant topic at all. For B < 0 relevance and variance are maximized, and thus diversity is minimized. This favors one particular interpretation with high E(R k ) but also high Var(R k ), which increases the risk of missing out other plausible interpretations altogether.

Finding a reranking that globally optimizes the objective in (1) is infeasible, as it would require testing all permutations of the original ranking. Thus, following common practice, we approximate the optimal reranking using a greedy algorithm that selects for each new rank k the result r i such that the increase in the objective at rank k (O k  − O k−1) is maximized (Wang and Zhu 2009):

$$ \begin{aligned} O_k - O_{k-1} & = {\sum}_{i=1}^{k} w_i E(r_i) - B {\sum}_{i=1}^{k} {\sum}_{j=1}^{k} w_i w_j c_{i,j} \\ & \quad - {\sum}_{i=1}^{k-1} w_i E(r_i) + B {\sum}_{i=1}^{k-1} {\sum}_{j=1}^{k-1} w_i w_j c_{i,j} \\ & = w_k (E(r_k) - B w_k c_{k,k} - 2 B {\sum}_{i=1}^{k-1} w_i c_{i,k}) \\ & \propto E(r_k) - B w_k c_{k,k} - 2 B {\sum}_{i=1}^{k-1} w_i c_{i,k} \end{aligned} $$
(3)

The multiplier w k is constant for all candidate documents to be selected for rank k and thus can be ignored.

In contrast to Wang and Zhu (2009) we do not estimate the expected relevance E(r k ) from the query and individual results, but rather rely on the original ranking of the search engine, which takes into account a variety of factors, including relevance, popularity, and user preferences. As search engines typically do not provide an actual score we set E(r k ) to the discount factor w i of a result document r i to be reranked to position k. c k,k is the (inner) variance σ2(r k ) of result r k at the new rank k, as defined in (7). This leads to the following optimization objective: At each new rank k select the document r i at the original rank i, such that

$$ w_i - B w_k \sigma^2(r_i) - 2 B \sum_{j=1}^{k-1} w_j c_{j,i} $$
(4)

is maximized.

A couple of technical statements are in order: To effectively balance E(R k ) and B*Var(R k ) they should be in the same order of magnitude. To this end, we calibrate B as follows:

$$ B = \frac{\beta}{avg_i \sigma^2(r_i)} $$
(5)

where avg i σ2(r i ) is the average (inner) variance over all results r i . With this approach, β = 1 gives approximately equal weight to relevance and diversity.Footnote 2

The complexity of the greedy reranking algorithm is O((n − k)*k*|V|) for reranking in the top-k results, given n overall results and vocabulary size |V|. Thus for relatively small k in the range of the typical 10 results on the first page online reranking is feasible, in particular, when combined with standard techniques such as caching popular queries.

3.2 Representation of documents

In order to calculate the variances we represent individual documents r i as vectors. We have experimented with two alternative representations: Smoothed (unigram) language models and latent topic models.

3.2.1 Language models

The Jelinek-Mercer smoothed language models (Zhai and Lafferty 2004) for a document r are defined as

$$ q_i = \lambda*p(v_i|r) + (1-\lambda)*p(v_i) $$
(6)

where p(v i |r) is the relative frequency of term v i in r, and p(v i ) is the relative collection frequency of v i . For smoothing we use the relatively largeFootnote 3 λ = 0.99.

Given two vectors \(U = u_1 {\ldots} u_n\) and \(Q = q_1{\ldots} q_n, \) their co-variance is defined as:

$$ \begin{aligned} Var(U,Q) & = \frac{1}{n} \sum_{i=1}^{n} (u_i - \bar u)(q_i - \bar q) \\ & = \frac{1}{n} \sum_{i=1}^{n} u_i q_i - \frac{1}{n^2} \\ \end{aligned} $$
(7)

The simplification is based on \(\bar u = \bar q = 1/n. \)

It is interesting to compare this to cosine similarity used by other approaches to diversification:

$$ Cos(U,Q) = \frac{\sum_{i=1}^{n} u_i q_i}{\sqrt{\sum_{i=1}^{n} u_i^2}{\sqrt{\sum_{i=1}^n q_i^2}}} $$

As can be seen, covariance and cosine similarity differ only w.r.t. normalization, which plays a minor role when operating on vectors representing a normalized probability distribution. However, whereas minimizing the mutual cosine similarity between results only accounts for the overlap between results, minimizing the overall variance of a result list also accounts for the inner variance of individual results. Thereby, results that cover more aspects of a query will tend to be ranked higher.

3.2.2 Latent Dirichlet allocation

Smoothed language models may suffer from the curse of dimensionality, and thus not properly represent the topics or aspects of a result list. As a consequence, variance measured directly on the bag of words may not be a good indicator for topical coverage. For example, if two results are about the same topic, but use different vocabulary, their covariance will be underestimated.

Thus as an alternative representation, we have also experimented with Latent Dirichlet allocation (LDA) (Blei et al. 2003), which maps documents to a mixture of only a few latent topics. Variance is then estimated on the much lower dimensional representation of the latent topics \(P(z_i=j \mid d_i)\) as defined in (11) rather than on the bag of words derived from (6).

The principal idea behind LDA is based on the hypothesis that a person writing a document has certain topics in mind. To write about a topic then means to pick a word with a certain probability from the pool of words of that topic. A whole document can then be represented as a mixture of different topics. For Web pages where the author of a document can be considered one entity, these topics reflect the entity’s view of this document and her particular vocabulary.

The modeling process of LDA can be described as finding a mixture of topics for each Web page, i.e., \(P(z \mid d), \) with each topic described by terms following another probability distribution, i.e., \(P(t \mid z). \) This can be formalized as

$$ P(t_i \mid d)=\sum_{j=1}^{Z}{P(t_i \mid z_i=j)P(z_i=j \mid d)}, $$
(8)

where \(P(t_i \mid d)\) is the probability of the ith term for a given document d and z i is the latent topic. \(P(t_i \mid z_i=j)\) is the probability of t i within topic j. \(P(z_i=j \mid d)\) is the probability of picking a term from topic j in the document. The number of latent topics Z has to be defined in advance and allows to adjust the degree of specialization of the latent topics. LDA estimates the topic-term distribution \(P(t \mid z)\) and the document-topic distribution \(P(z \mid d)\) from an unlabeled corpus of documentsFootnote 4 using Dirichlet priors for the distributions and a fixed number of topics. Gibbs sampling (Griffiths and Steyvers 2004) is one possible approach to this end: It iterates multiple times over each term t i in document d i , and samples a new topic j for the term based on the probability P(z i  = j | t i d i z i ) based on (9), until the LDA model parameters converge.

$$ P(z_i = j \mid t_i, d_i, z_{-i}) \propto \frac{C^{TZ}_{t{_i}j}+\beta}{\sum_{t}{C^{TZ}_{tj}}+T\beta} \frac{C^{DZ}_{d{_i}j}+\alpha}{\sum_{z}{C^{DZ}_{d_{i}z}}+Z\alpha} $$
(9)

C TZ maintains a count of all topic–term assignments, C DZ counts the document–topic assignments, z i represents all topic–term and document–topic assignments except the current assignment z i for term t i , and α and β are the (symmetric) hyperparameters for the Dirichlet priors, serving as smoothing parameters for the counts. Based on the counts the posterior probabilities in (8) can be estimated as follows:

$$ P(t_i \mid z_i=j) = \frac{C^{TZ}_{t{_i}j}+\beta}{\sum_{t}{C^{TZ}_{tj}}+T\beta} $$
(10)
$$ P(z_i=j \mid d_i) = \frac{C^{DZ}_{d{_i}j}+\alpha}{\sum_{z}{C^{DZ}_{d_{i}z}}+Z\alpha} $$
(11)

In our evaluation we experimented with different numbers of topics, and achieved best results with 1000 topics for the entire search result, from which only few topics were associated to each individual result.

4 Evaluating diversity

To evaluate our approach we propose to use Wikipedia as a source of ground truth for diversity. Wikipedia has been shown to be an effective and reliable source of semantic knowledge (Gabrilovich and Markovitch 2007) and was used before in the context of diversity evaluation (Gollapudi and Sharma 2009). We think that this kind of evaluation is superior to manually selected corpora to judge the diversity of Web search result rankings. Hand-crafted collections like the one used for the TREC subtopic retrieval task are not as complete and representative as a community maintained encyclopedia like Wikipedia.

For the evaluation, we compare the original ranking with the diversity-oriented reranking and a baseline reranking based on a simple notion of relevance. The test queries are taken from the titles of Wikipedia disambiguation pages. The basic assumption is that Wikipedia articles cover the major alternative interpretations of ambiguous queries. This claim was recently backed by Santamaría et al. (2010), who showed that more than 50% of pages in their test set can be assigned to Wikipedia pages representing a particular sense of the query. Moreover, we also compare the various rankings with the “complete” search result returned for each query.

We conducted several experiments to evaluate our reranking algorithm and to verify our evaluation approach:

  1. 1.

    Reranking based on language models of search results.

  2. 2.

    Reranking based on topic models derived from Latent Dirichlet allocation.

  3. 3.

    Comparing diversity of result rankings from Google and Yahoo!.

  4. 4.

    Comparing our evaluation using TREC data and manual judgement.

4.1 Testdata

To evaluate diversity we are interested in queries that have a broad variety of aspects. This does not neccessarily mean that the queries are ambiguous. A keyword query like “Las Vegas” might have different meanings but even the interpretation as the name of a city has a lot of aspects and subtopics which diversity aware search engines should cover in the top-k results.

The generation of the ground truth testdata was a two phase process. Firstly, we took the Wikipedia disambiguation pages and removed all pages containing digits in the title (e.g. Wikipedia page “442_(disambiguation)”). Secondly, we searched in a Wikipedia MYSQL-dump with the title of the disambiguation page in the title field of the database. All titles returning between 10 and 100 Wikipedia pages were kept and the others discarded. We sorted the titles of the disambiguation pages by the sum of the inlink degree of the Wikipedia pages. The top 240 titles constitute our query set and the corresponding Wikipedia pages are our ground truth data. One example query (“billboard”) with its corresponding Wikipedia page titles is shown in Table 2.

To get the ranking of the commercial search engines from Google and Yahoo! we crawledFootnote 5 the result lists for each query up to rank 1000. For the Yahoo! search engine we got an average of 628 results per query; for Google we got 730. Search results from Wikipedia were discarded, in order to assess original and diversified rankings of non-Wikipedia results. Also all pages without textual content were removed from the collection. In addition, we removed boilerplate text from the result Web pages using boilerpipe,Footnote 6 an open source library for extracting fulltext from HTML pages, to obtain clean content for each page. For both search engines we got the original rankings for each query ordered by rank. For Wikipedia as well as for the search engine results we removed stopwords.Footnote 7 For each query we also computed a reranking of the original results based on a simple relevance assessment taking the term frequency for each query term into account.

For each rank k we define R k as the concatenation of all documents r i , 1 ≤ i ≤ k, after removal of Wikipedia results and results without textual content. The smoothed language model Q for each R k is computed as described in (6).

The language models U for Wikipedia are in addition weighted by the logarithm of the indegree of articles (d j ), in order to push more prominent interpretationsFootnote 8

$$ p(v_i|W) = \frac{\sum_{j=1}^{m} log_2(d_j)*n(v_i,w_j)}{\sum_{j=1}^{m} log_2(d_j)*|w_j|} $$
(12)

where m is the number of Wikipedia result pages for a query, n(v i w j ) is the frequency of word v i in article w j , and W signifies that the language model is conditioned on Wikipedia.

Unless otherwise noted, Q refers to the language model of top-k search results, S refers to the complete search result of a particular query, and U refers to Wikipedia articles which contain the query in their title.

4.2 Evaluation measures

As a measure for how well the top-k Web search results for a query approximate the correspondingFootnote 9 Wikipedia articles we calculate the Kullback-Leibler divergence between the smoothed unigram language models for the top-k results and for Wikipedia articles. This measure estimates the number of additional bits needed to encode the distribution \(U = u_1 {\ldots} u_n, \) using an optimal code for \(Q = q_1 {\ldots} q_n, \) where n = |V| is the combined vocabulary size.

$$ \begin{aligned} D_{KL}(U||Q) &= H(U;Q) - H(U) \\ &= {\sum}_{i=1}^{n}u_i*log_2\left(\frac{u_i}{q_i}\right) \end{aligned} $$
(13)

In our setting, distribution Q is the combined language model of the top-k search results and distribution U is the language model for the Wikipedia articles. Thus D KL (U||Q) can be directly used to measure the similarity with the combined Wikipedia articles and assess the coverage of the top-k Web pages with respect to Wikipedia.

To assess the effect of diversification on the search results Q, we also measure the entropy H(Q) for the different rankings. The higher the entropy of the top-k results, the more diverse is the set of top-k Web pages.

$$ H(Q) = - \sum_{i=1}^{|V|} q_i*log_2(q_i) $$
(14)

Spearman’s rank correlation coefficient ρ is used to quantify the degree of reranking between two rankings x and y.

$$ \rho(x,y)=1-\frac{6\sum_{i=1}^{n}{(x_i-y_i)^2}}{n(n^2-1)} $$
(15)

where x i and y i are the ranks at position i, and n is the number of results. A value of 1.0 means perfect correlation, 0.0 no correlation and −1.0 perfect negative correlation. In our setting, we are interested in the degree of reranking performed by the different algorithms with respect to the original ranking.

4.3 An example for diversification

To exemplify the effect of diversification, we randomly picked the query “Caesar” from our evaluation set. Table 1 gives the top 10 results for this query by the original ranking and by rankings diversified on the basis of latent topic models and language models.Footnote 10 The different colors reflect a broad categorization of the pages. While the original ranking covers some aspects of this query, including the historical persons “Julius Caesar” and “Caesar Augustus”, hotels named “Caesar”, and other companies using the iconic label “Caesar”, both diversified rerankings arguably cover also other aspects, including movies and dramas about “Caesar”, pointers to Julius Caesar’s literary work, and also a broader variety of companies labeled “Caesar”, with the notable exception of hotels. For other queries we can observe a similar effect. Generally, the diversified reranking achieves a better topic coverage in the top 10 results compared with the original ranking.

Table 1 Top 10 search results for query “Caesar” using Google search engine
Table 2 Wikipedia pages containing the query “Billboard” and its corresponding link indegrees

5 Results

We thoroughly analyzed the results and were particularly interested in three aspects: (1) Comparing diversity using language models and topic models described in Sect. 5.1, (2) balancing relevance and diversity (Sect. 5.2) and (3) comparing the diversity of Google and Yahoo! (Sect. 5.3)

5.1 Diversification by LM versus LDA

The goal of our first evaluation is to assess the effect of diversification for the two proposed models. Figures 1a and b show the Kullback-Leibler divergences D KL (U||Q) and D KL (Q||U) between the aggregated Wikipedia language models U and various rankings Q for ranks k = 1..51, averaged over all 240 queries in our testset. The original ranking from Google is labeled orig. For the “optimal” ranking opti, we greedily reranked search results such that D KL (U||Q) is minimal for each rank k. For reranking we used β = 1, balancing relevance and diversity evenly.

Fig. 1
figure 1

Kullback-Leibler divergence between the top-k search results (Q) and Wikipedia (U)

As is to be expected, the ranking rel based solely on relevance has the largest divergence D KL to Wikipedia in both directions. Focussing only on relevance while disregarding possible redundancies between individual results leads to a bad topical coverage in the first few results. The original ranking orig has the second largest divergence, and the optimal ranking opti has the smallest divergence. The diversified reranking using language models lm slightly outperforms latent topic models lda at all ranks. However, this comes at the cost of a significantly larger amount of reranking: The average Spearman’s rank correlation coefficient ρ between lda and orig is 0.23, which is more than twice of ρ = 0.09 between lm and orig Footnote 11 Interestingly, also the “optimal” reranking opti has a significantly higher ρ = 0.17.

Figure 2a shows how quickly the various rankings Q approximate the language model of the overall search result S for each query. The smaller the divergence for the top k results the better they represent the overall result. Again, the relevance based ranking rel shows the highest divergence overall, followed by the original ranking orig. But the optimal ranking opti is surpassed by lm at rank 12, and by lda at about rank 25. Thus, optimizing w.r.t. Wikipedia content of a query generally also achieves a better representation of the search result in the first few ranks, but the generic diversification by minimizing variance performs slightly better for higher ranks (The plot for D KL (Q||S) is very similar).

Fig. 2
figure 2

Kullback-Leibler divergence between the top-k search results and the complete (600 pages per query) search results (left) and entropy of the top-k search results (right)

The Kullback-Leibler divergence only measures the additional bits needed for representing query result distribution Q given an optimal code for the Wikipedia distribution U, i.e., it explicitly disregards the entropy H(Q). Figure 2b shows the entropy for the various rerankings. As is to be expected, reranking by minimizing the variance leads to a higher entropy H(Q) at all ranks. Also the optimal ranking opti leads to a higher entropy, but levels out at a slightly lower entropy than for the diversified ranks. Naturally, the increased entropy H(Q) also leads to an increased cross-entropy H(Q;U) (not shown). One consequence of this is that the improvement in divergence by diversification is less pronounced for D KL (Q||U) (see Figure 1b) than for D KL (U||Q). After about rank 15, the gain of diversification is balanced by the cost of diversification in terms of entropy. The entropy of the ranking rel based on releance is by far the smallest at all ranks. Even at rank 50 it just reaches the entropy of the diversified rerankings at rank 1. This again illustrates that ranking based on relevance only covers only few aspects of the search result.

The effects of diversification for Yahoo! search results are similar; see Sect. 5.3 for a comparison of Yahoo! and Google.

5.2 Balancing relevance and diversity

In this section we analyse the effect of the parameter β, which balances between relevance and diversity. To this end, we selected 10 queries, where the difference between divergence of the top 10 results and of the complete result is maximal and varied β between 0.1 and 5. Figure 3 (left) compares the divergence using language models as document representations and shows how the KL-Divergence of the rerankings with different βs lie in between the original ranking and the optimal ranking. Increasing β beyond 5.0 does not further improve the results.

Fig. 3
figure 3

Kullback-Leibler divergence D KL (U||Q) for different β using language model reranking (left) and Spearman’s rank correlation coefficient for different β and for the optimal ranking

The right table in Figure 3 compares the rank correlation for both search engines and for the optimal ranking. The general behaviour is consistent. The divergence decreases at all ranks with increasing β at the cost of a higher degree of reranking, resulting in a lower rank correlation ρ. β > 1 achieves only a relatively small improvement, β > 5 (not shown) achieves no further visible improvement. As already observed in Sect. 5.1, diversification based on latent topic models lda generally achieves a ranking closer to the original ranking than diversification based on language models lm.

5.3 Comparing two search engines

Search engines certainly also make an effort towards covering the most important aspects of queries as one of their optimization objectives. Our evaluation framework can also be used to compare topic coverage in the top-k results for different search engines. Figure 4 shows the difference D KL (Google) − D KL (Yahoo!) of the two evaluated search engines of the symmetric Kullback-Leibler divergence:

$$ D_{KL}(U,Q) = \frac{D_{KL}(U||Q)+D_{KL}(Q||U)}{2} $$
(16)

One graph shows the divergence with respect to Wikipedia and the other one with respect to the complete search result. Apparently Google search results tend to be significantly less diverse than Yahoo!’s search results; in the top ranking positions the divergence of Google is almost 1 bit higher than the divergence of Yahoo!.

Fig. 4
figure 4

Comparison of the diversity of Google and Yahoo! using symmetric D KL (Google) − D KL (Yahoo!)

Of course such a comparison should not be taken as evidence on any inherent bias of a search engine. Firstly, the observable difference may in part be due to different strategies of including Wikipedia pages, which were discarded for evaluation. In particular, if a search engine tends to rank Wikipedia pages on top, and diversifies the next few results w.r.t. the top results, discarding Wikipedia pages from the evaluation will lead to understimating the topical coverage of the remaining results. Secondly, the two search engines employ slightly different strategies in grouping related search results, which were not taken into account in our evaluation, where we mapped search results to a flat ranked list. Finally, of course Wikipedia does not necessarily cover all possible interpretations of a query.

6 Evaluating Wikipedia-based evaluation

To verify the viability of our proposed diversity evaluation based on Kullback-Leibler divergence and Wikipedia, we compare the results with two other diversity evaluation frameworks: Subtopic retrieval from TREC and a manual evaluation using hand annotated search results from Santamaría et al. (2010).

6.1 Comparison with TREC evaluation

In order to assess our proposed evaluation criterion based on Wikipedia coverage, we have applied our diversification approach on TREC data. In the Web Track 2009, TREC introduced a dataset to evaluate subtopic coverage of rankings (Clarke et al. 2009). They provided a Web crawl, 50 queries, and automatically extracted subtopics for these queries. This extraction was done using the query log of a commercial search engine, co-click data, and other information. A set of Webpages from the crawl was then annotated manually with the relevant subtopic or with “not relevant” in case the page does not cover any subtopic.

To compare this evaluation framework with our Wikipedia-based approach we identified a subset of the data satisfying our requirements:

  1. 1.

    A query using Wikipedia’s search mechanism must return at least 100 Wikipedia pages.

  2. 2.

    An annotated document must occur in the top 1000 results of Google.

Among the 50 TREC queries, 7 did not yield any Wikipedia page, 8 less then 10, and 8 less then 100 result pages when searching Wikipedia. This leaves us with 27 queries with up to 500 ranked, relevant Wikipedia pages. The average overlap of Web search results from our crawl with annotated TREC pages matched by URL is 26.6 pages per query for Google. This leaves us with only a few pages annotated as relevant for a specific subtopic and many queries with no annotated pages for a certain subtopic.

Figure 5 (left) shows how the original ranking, and the introduced reranking approaches approximate the corresponding Wikipedia content for the 27 TREC queries (c.f. Fig. 1a). Again, by construction, the optimal ranking covers Wikipedia content best in the top k results. However, for the TREC queries, diversification based on the LDA topics slightly outperforms diversification based on the language models. The original ranking depicts the largest distance to Wikipedia.

Fig. 5
figure 5

Kullback-Leibler divergence D KL (U||Q) (left) and α–nDCG values (right) for the TREC evaluation

Using the manually assessed subtopics and evaluation scripts provided by the TREC organizers, we computed α–nDCG (Clarke et al. 2008), shown in Fig. 5 (right). On this small dataset the relevance based ranking clearly performs worst, while the original ranking and the rerankings by means of minimizing variance perform rather similarly. The slight differences are not statistically significant. Only the “optimal” reranking achieves a significant improvement for all metrics but α–nDCG@5 according to a 2-tailed paired t test with confidence well above 95% (marked with asterisks in the table). This indicates that diversification based on a more or less representative goal model, such as Wikipedia, can outperform diversification based on analyzing only the search results. Investigating and evaluating such a goal-driven approach to diversification in more detail is an interesting subject for future work.

In summary, for the subset of the TREC queries where we had enough data, diversification based on latent topic models generally achieves better coverage of Wikipedia than diversification based on language models. Probably due to the rather small overlap between manual TREC assessments and the search engine results, the original rankings and rerankings achieved similar performance with respect to α-nDCG.

To put this into perspective, we note that the clearly best run in the diversity task of TREC 2009 (Clarke et al. 2009) also just took the original ranking provided by a major commercial search engine. Thus the achieved improvement over the original ranking is fairly remarkable.

6.2 Comparison with manual evaluation

As a second dataset to validate our evaluation method we used a test corpus compiled by Santamaría et al. (2010). This corpus comprises Web search results for 40 ambiguous queries consisting of 15 ambiguous nouns from the Senseval-3 dataset and 25 additional ambiguous nouns, where one of the senses is a band name. For all senses there exists a corresponding Wikipedia article. For each query the top 150 documents have been manually annotated with one or more senses. Documents with little text, disamgiguation pages, and documents not corresponding to any Wikipedia sense have been discarded.

On the basis of the manual annotations, we have again computed α–nDCG. Figure 6 (right) compares the averaged α–nDCG for the 40 queries with our proposed evaluation criterion of Wikipedia coverage measured by the Kullback-Leibler Distance between the search result and the language model of Wikipedia articles (Figure 6 (left)). As can be seen, the relative performance of the various rerankings is the same for both evaluation measures, in particular at smaller ranks. The original ranking orig is outperformed by the diversified ranking based on language models lm and topic models lda, which in turn are outperformed by the optimal ranking opti based on Wikipedia. This indicates that our proposed evaluation criterion for diversification, which does not require manual annotation, corresponds well with the widely used measure α–nDCG based on manual annotations. Moreover, the fact that the optimal reranking achieves the best α–nDCG confirmes the observation of Santamaría et al. (2010) that Wikipedia can be effectively used as a target model for diversification, provided that it covers the most prominent aspects of a query.

Fig. 6
figure 6

Comparison of D KL (U||Q) values (left) and α–nDCG values (right) for the manual evaluation

7 Conclusions and future work

We have presented a reranking approach for balancing the top-k results of Web search engines with respect to diversity by minimizing the variance of their underlying language models and topic models. Our extensive evaluation against Wikipedia has demonstrated that the approach effectively achieves a better coverage of the various topics and aspects pertaining to a query. Our evaluation using the TREC data and supplied evaluation framework confirms these findings and validates the presented Wikipedia-based diversity evaluation as an alternative to costly manual diversity assessment.

We further demonstrated that diversification based on language models achieves a slightly better coverage in terms of Wikipedia language models than diversification based on topic models, but topic models accomplish diversification with a significantly lesser amount of reranking.

We are currently developing an online system to rerank on-the-fly based on Latent Dirichlet allocation. We want to apply result diversification in the context of summarization of search results as well as of events in blogs and newspaper articles. Moreover, we want to experiment with using cross-entropy and Kullback-Leibler divergence directly for reranking search results such that the top-k results provide a representative overview on the complete result. Finally, we also want to develop approaches to diversification and evaluation, which better focus on the topical content of documents.