1 Introduction

Many users have been using blogs (or weblogs) to express their thoughts or opinions about a wide range of topics including political issues, product reviews and diary-like private posts. As the number of blog users has increased, the importance of blogs as an information source has risen. As a result, the need for a customized elaborate search system, which aims to find useful information in the blogosphere, has grown. Several commercial search engines such as Google Footnote 1 and Technorati Footnote 2 have started to provide blog search services.

Nowadays, it has become common for blog users to search for blog feeds (e.g. RSS, ATOM) relevant to topics that interest them, and then subscribe to the feeds using a feed reader such as an RSS reader. In this scenario, a key issue is how to identify blog feeds that are relevant and dedicated to a given topic. This task is blog feed search, which is one of the most important blog search services. The Blog Distillation task of TREC Blog Track (Macdonald et al. 2008; Ounis et al. 2009; Macdonald et al. 2010) also reflects the increasing interest in blog feed search.

A straightforward approach for blog feed search would be to apply existing retrieval models developed in ad-hoc retrieval. For example, we can view a blog feed as a virtual document by concatenating all constituent posts belonging to the blog feed, and then readily apply existing retrieval models without any modification. In fact, most previous work on blog feed search used this approach as the baseline system (Macdonald et al. 2008; Ounis et al. 2009; Macdonald et al. 2010).

However, blog feed search has some characteristics that limit the performance of the straightfoward approach. First, the retrieval unit is a blog feed, which is an aggregation of its constituent posts, not a single blog post. In this regard, blog feed search should consider how to model the relationship between the relevance of blog posts and the blog feed, in response to a given topic. Second, most blog feeds contain topically diverse blog posts, depending on a blogger’s interest. In other words, a blog feed generally addresses a large number of topics. The topical diversity of blog feeds makes it difficult for blog feed search systems to find out which blog feeds are relevant to users’ information needs. Third, blog feed search has to deal with more noisy data than traditional search tasks. The blog corpus is not as topically coherent as the news corpus, and may also have non-topical contents such as spam blogs and blog comment spam, which advertise commercial products and services (Kolari et al. 2006). Therefore, feed search techniques should be robust to this noisy environment.

Among the above characteristics, this paper focuses on the performance deterioration caused by the topical diversity of blog feeds. To mitigate this problem, our approaches are motivated by the passage retrieval technique, which is one of the most effective techniques to deal with topical diversity at document level for ad-hoc retrieval. We introduce global evidence and local evidence for evaluating the relevance of a blog feed in response to a query. These two types of evidence correspond to document-level and passage-level evidence for passage retrieval, respectively. Whereas global evidence is derived from all the constituent posts within a feed, local evidence is defined using a few blog posts that are highly relevant to a query.

Different from most previous studies that use only global evidence to estimate the relevance of a blog feed, we explicitly define and take advantage of the local evidence on both initial retrieval and pseudo-relevance feedback (PRF). For initial retrieval, we propose an approach to integrate the global evidence with the local evidence, and verify that the usage of local evidence is effective in mitigating the topical diversity problem of blog feed search. Furthermore, we present a novel document selection approach for PRF, based on the local evidence of a blog feed. While several research work have looked at the initial retrieval model for blog feed search, PRF has not been well studied on blog feed search, despite its importance. Our approaches select feedback documents based on local evidence of the top ranked blog feeds in order to improve the “precision” and “aspect recall” (Kurland et al. 2005) of feedback information, which are two important factors affecting the performance of the feedback model. Experimental results show that the proposed method achieves MAP scores that are 6, 2 and 11% better than the best results of TREC 07, 08 and 09, respectively. These results are notable in that our work is the first successful feedback approach for blog feed search in a “closed setting” using only a test collection. In general, it is the common interest to investigate whether the PRF, in the context of the closed setting, improves performance over the baseline for various retrieval tasks including ad-hoc retrieval and web search (Rocchio 1971; Yu et al. 2003; Zhai and Lafferty 2001; Na et al. 2008a; Lavrenko and Croft 2001). Furthermore, to the best of our knowledge, while the existing work which reported the improvements on PRF for blog feed search is based on an external resource, we improve the performances of the blog feed search without resorting to any other resources.

The rest of the paper is organized as follows. In Sect. 2, we present the issue of topical diversity that motivates our work, and address feed search models using the global and local evidence of a blog feed. In Sect. 3, we conduct several experiments to evaluate the performance of our proposed methods, and discuss the difference between our approach and previous work. In Sects. 4 and 5, we describe our approaches for PRF, and compare the results with traditional feedback approaches. In Sect. 6, we briefly survey related work on blog feed search. Finally, we conclude the paper and discuss future work in Sect. 7.

2 Initial retrieval model for blog feed search

2.1 Motivation: topical diversity of blog feeds

Topical diversity is a problem not only for blog feed search, but also for ad-hoc retrieval at the document level. A document can contain diverse topics, particularly when it is long. As a result, long documents are likely to be over-penalized by a retrieval algorithm even though they are relevant to a given topic, resulting in poor retrieval performance (Salton et al. 1993).

Many approaches have been proposed to solve this problem. One of the effective approaches is passage retrieval, in which the relevance score of a document is boosted by an additional score estimated using passage-level evidence. Passage retrieval has turned out to significantly improve the baseline using only traditional document-level evidence (Callan 1994; Kaszkiel and Zobel 1997; Kaszkiel and Zobel 2001; Salton et al. 1993; Na et al. 2008b; Bendersky and Kurland 2010).

Passage-level evidence has also been applied to PRF (Allan 1995; Na et al. 2008a), namely passage-based feedback which uses passages as the context for query expansion instead of documents. Passage-based feedback has been reported to result in significant improvements over conventional document-based feedback.

The topical diversity has a greater negative impact on blog feed search than on ad-hoc retrieval, because a blog feed which is the retrieval unit of blog feed search consists of many blog posts. A blog feed usually contains more topics than a document, and the topics of the feed are likely to be less coherent than those of the document. This means that even if a blog feed is relevant to a given topic, a large number of posts within the feed can be irrelevant.

In practice, most relevance judgments currently used for blog feed search regard a blog feed as relevant even if only some of the posts within the feed are relevant. For example, Seo and Croft (2008) introduced several criteria for relevance judgments, where relevant feeds are divided into three levels according to the proportion of relevant posts in a feed. Their minimum cutoff criterion to determine if a blog feed is relevant is whether at least 25% of all the posts within the feed are relevant.

2.2 Retrieval framework

To deal with the topical diversity of a blog feed, this paper proposes a novel approach based on passage retrieval. We first define the global and local evidence of a blog feed. To achieve this, we make correspondences between a document and a blog feed, and a passage and a subset of blog posts within the feed. Then, we evaluate the global evidence using all the constituent posts within the feed, corresponding to the document-level evidence in passage retrieval. We also estimate the local evidence using a subset of the blog posts, corresponding to passage-level evidence.

In the following sections, we address how the evidence affects the relevance of a blog feed in response to a given query.

2.2.1 Global evidence and local evidence

Global evidence can be estimated using the overall information of a blog feed (i.e. all constituent posts). This addresses one of the important issues for evaluating the relevance of a blog feed to a query. Global evidence reflects how much the feed is devoted to a given query. Given the query, we evaluate the devotedness of a blog feed using the proportion of relevant blog posts within the feed. We assume that the more devoted a blog feed is to a given query, the more likely it is to be relevant.

Local evidence can be evaluated using a subset of blog posts within a blog feed. As the definition of a passage is important for passage retrieval, the way local evidence is defined is also a critical issue. In this paper, we utilize a set of the T most relevant posts within a blog feed to evaluate the local evidence of the feed in response to a given query. We assume that the top-ranked posts can be a representative sample of the feed about a query topic, conceptually corresponding to a passage in passage retrieval.

2.2.2 Combination of evidence

Global and local evidence have their own limitations in terms of blog feed search. First, global evidence tends to prefer a small blog feed (i.e. a feed with a small number of posts) to a large one, because small feeds are less likely to contain diverse topics. Second, local evidence uses only a few relevant posts within a blog feed, and thus cannot identify which blog feed is more devoted to a given query.

To overcome the limitations of each type of evidence, our retrieval model combines both global and local evidence. Let R(QF) be the relevance score of a blog feed F in response to a query Q. We use linear interpolation to combine the two types of evidence as follows:

$$ R(Q,F) = (1-\alpha)R_{G}(Q,F) + \alpha R_{L}(Q,F) $$
(1)

where R G and R L are the relevance scores estimated using the global and local evidence, respectively, and α is a weight parameter to control the relative importance of the two types of evidence.

2.3 Basic retrieval models

Since a blog feed consists of a number of blog posts, there may be several approaches for representing the blog feed according to the granularity level. From previous work on blog feed search (Elsas et al. 2008; Macdonald and Ounis 2008; Seo and Croft 2008; Macdonald et al. 2008; Ounis et al. 2009), we can observe that there are two ways to represent a blog feed. In this study, we use the models proposed by Elsas et al. (2008) in order to represent a blog feed: “Large Document Model” (LDM) and “Small Document Model” (SDM).

Let L be a subset of a blog feed F, which will be defined differently according to the type of evidence (i.e. global or local).

First, LDM regards a blog feed as a single large document represented by concatenating all the constituent posts within it. Then, the relevance score of the feed is estimated using the relevance score between the virtual document and a query. Therefore, most ad-hoc retrieval techniques can be applied to LDM.

$$ R^{LDM}(Q,F) = Score(Q,VD) $$
(2)

where VD is a virtual document represented by concatenating the blog posts within the set L, and Score(QVD) can be evaluated using (4) to be defined later.

LDM has some problems which arise from representing a blog feed by concatenating all posts within it without any consideration of the relationship among the posts (Seo and Croft 2008).

Second, SDM regards a blog feed as a collection of all the blog posts within it. Then, its relevance score is evaluated by summing up the relevance score of each post in response to a query. The score function for SDM is defined as follows:

$$ R^{SDM}(Q,F) = \sum_{D \in L} Score(Q,D)P(D|L) $$
(3)

where P(D|L) means the probability of selecting a blog post D, given the set L, and Score(QD) can be evaluated using (4).

There are many possible approaches to estimate the probability P(D|L) (Elsas et al. 2008). However, we assume the probability P(D|L) has a uniform distribution because our interest is in exploring the influence of global and local evidence on the performance of the blog feed search.

The remaining issue is how to define the subset L of a blog feed F. We construct the subset L according to each evidence, as follows:

  1. 1.

    Global Evidence: L consists of all blog posts within a feed F, i.e. L = F.

  2. 2.

    Local Evidence: L is a set of top T ranked blog posts within a feed F in response to a given query, denoted by Top(TF).

Then, we can define two different models for each type of evidence, depending on which representation (LDM or SDM) is used. First, global evidence has the following two models:

  • Global Large Document Model (GLD), R LDM G , uses global evidence with the LDM for feed representation (i.e. L = F in (2)). GLD was used as the baseline for many systems on the Blog Distillation task (Macdonald et al. 2008; Ounis et al. 2009; Macdonald et al. 2010), and the results show that this model is effective without resorting to any other techniques or resources.

  • Global Small Document Model (GSD), R SDM G , uses global evidence with the SDM for feed representation (i.e. L = F in (3)).

Similarly, local evidence has the following two models:

  • Local Large Document Model (LLD), R LDM L , uses local evidence with the LDM for feed representation (i.e. L = Top(TF) in (2)). Unlike the GLD, the virtual document is represented by a concatenation of blog posts relevant to a query, not all of the posts within the blog feed F.

  • Local Small Document Model (LSD), R SDM L , uses local evidence with the SDM for feed representation (i.e. L = Top(TF) in (3)).

2.4 Combined models

Four models are possible in (1) using two global models (GLD and GSD) and two local models (LLD and LSD) as follows:

  • GLD+LLD: GLD for R G (QF), and LLD for R L (Q, F), formulated by R(QF) = (1 − α)R LDM G (QF) + αR LDM L (QF).

  • GLD+LSD: GLD for R G (QF), and LSD for R L (Q, F), formulated by R(QF) = (1 − α)R LDM G (QF) + αR SDM L (QF).

  • GSD+LLD: GSD for R G (QF), and LLD for R L (Q, F), formulated by R(QF) = (1 − α)R SDM G (QF) + αR LDM L (QF).

  • GSD+LSD: GSD for R G (QF), and LSD for R L (Q, F), formulated by R(QF) = (1 − α)R SDM G (QF) + αR SDM L (QF).

2.5 Relevance score function

The remaining issue is the score function to evaluate the relevance between a document and a query. Since LDM views a blog feed as a large document, we need a score function to estimate the relevance between a large document (blog feed) and a query. For SDM, we also need a score function to evaluate the relevance between each blog post and a query. To this end, we use one of the representative state-of-the-art retrieval models, the KL-divergence language model (Lafferty and Zhai 2001).

Let θ Q and θ D be a query language model and a document language model, respectively. We use Dirichlet smoothing (Zhai and Lafferty 2004) to estimate the document language model. Our score function is as follows:

$$ Score(Q,D) \buildrel{def} \over {\mathop{=}} \sum_{w \in Q \cap D} P(w|\theta_{Q}) \cdot \log \left( 1 + \frac{tf(w,D)}{\mu P(w|C)} \right) + \log \frac{\mu}{\mu + |D|} $$
(4)

where tf(wD) is the frequency of term w within a document \(D,\, P(w|C)=\frac{ctf_w}{|C|}: ctf_w\) is the number of times term w occurred in the entire collection and μ is a smoothing parameter.

In the initial retrieval, a query language model is estimated by using the maximum likelihood estimate. We then update the query language model based on feedback documents. In Sect. 4, we address novel feedback approaches to improve the performance of the blog feed search.

3 Retrieval experiments

We investigated the influence of global and local evidence on the performance of blog feed search according to varying the weight parameter α. The experimental results show that our models based on passage retrieval are simple, but effective for blog feed search.

3.1 Experimental setup

3.1.1 Data set

The TREC Blogs06 and Blogs08 collections (Macdonald 2006; Macdonald et al. 2010) were used for our experiments. Each collection is a big sample from the blogosphere. Table 1 shows the statistics of the collections. For the TREC 2009 Blog Distillation task, we evaluated the topical relevance of a blog feed with only 39 topicsFootnote 3 that have at least one relevant blog (Macdonald et al. 2010).

Table 1 Statistics for the test collections

We only used permalinks (blog posts) for the experiments. We discarded the HTML tags of the blog posts. The posts were also processed by stemming using the Porter stemmer and eliminating stopwords using the INQUERY words stoplist (Allan et al. 2001).

3.1.2 Parameter setting and evaluation measures

We evaluated four basic models, GLD, GSD, LLD and LSD, using only the title field of each topic as a query. We also evaluated four combined models, GLD+LLD, GLD+LSD, GSD+LLD and GSD+LSD. Each model has a few parameters. The global models (GLD and GSD) have one parameter, i.e. the parameter μ for Dirichlet smoothing. The local models (LLD and LSD) have two parameters, i.e. the smoothing parameter μ and T which controls the number of posts used to estimate the local evidence of their feed. In addition to these parameters, the combined models have a weight parameter α.

We trained the parameters using the 07 topics for evaluating the performance of the 08 topics, and vice-versa. Then, the parameters for the 09 topics are trained using the 07 and 08 topics. We selected the parameters resulting in the best MAP score.

Similar to the Blog Distillation task, we retrieved the 100 most relevant blog feeds in response to each query. We used the mean average precision (MAP) and the precision at rank 10 (Pr@10) as the evaluation measure.

3.2 Results and discussion

Table 2 shows the performance of each model. We performed the Wilcoxon signed rank test to examine whether or not the improvement of the combined models over the baseline (GLD) was statistically significant. The baseline outperformed other basic models for all the topic sets. These results are similar to those from previous work (Macdonald et al. 2008; Ounis et al. 2009).

Table 2 The performance of basic models and combined models

The best performance was obtained from GSD+LSD. Compared with the basic models, all the combined models improved the performance significantly and consistently. This confirms our hypothesis that the combined approach reduces the risk of separately using each type of evidence, and leads to better performance. An interesting observation is that the best performance of the basic models resulted from using Large Document ModelFootnote 4 (GLD), but the best performance of the combined models resulted from using Small Document Model (GSD+LSD). This implies that the interaction of the global and local evidence for blog feed search is better captured by SDM than LDM.

Figure 1a shows how varying the weight parameter α affects the performance of blog feed search when using GSD+LSD, which shows the best performance. The weight parameter α controls the importance between the global and local evidence. For all the topic sets, we can obtain the best performance when the weight parameter is 0.7 or 0.8. We can confirm that the two types of evidence should be considered together to improve the performance of blog feed search.

Fig. 1
figure 1

MAP scores for varying the parameters, α and T, under the GSD+LSD retrieval model. a MAP scores for varying the weight parameter α, b MAP scores for varying the value of T

Figure 1b shows the influence of the parameter T on the performance of blog feed search when using GSD+LSD. T controls how many blog posts are used for the local evidence of a blog feed in response to a given query. We can obtain the best performance when T is set to 2. This reveals that using a few highly relevant posts within a blog feed is effective in evaluating the local evidence.

3.3 Comparison with other approaches

In the experiments, we showed that the use of local evidence is quite helpful in improving the performance of blog feed search. Some previous researchers had already utilized similar methods.

Macdonald and Ounis (2008), motivated by the Voting Model for the expert search task, suggested expCombSUM. In expCombSUM, the highly relevant posts have a large effect on the relevance score of a blog feed. Due to its weighted approach using query-relevant scores, expCombSUM plays a similar role to local evidence. Elsas et al. (2008) also proposed the Entry Centrality Component as a part of the Small Document Model. The component estimates a probability distribution to measure the similarity between a blog post and its feed, and controls the weight of each post to evaluate the relevance between its feed and a query.

However, these approaches are different from ours in some respects. They consider all constituent posts within a blog feed. Although the posts are differently weighted, the approaches can be regarded as using weighted global evidence. In contrast, our model actively finds local evidence corresponding to the passage-level evidence for passage retrieval. Furthermore, whereas their approaches can only be applied in SDM, our model provides a more flexible and expanded framework, in the sense that two types of evidence can be estimated regardless of representation methods (e.g. LDM or SDM).

One of the most similar approaches to our model is the PCS-GR model suggested by Seo and Croft (2008). PCS-GR is an approach combining their Global Representation and Pseudo-Cluster based Selection, corresponding to our GLD+LSD approach. Like our results, they showed that the combining approach results in significant improvements in their well-designed experiments. However, our motivation is different from theirs. Whereas they introduced a combining approach to penalize topically-diverse feeds, we proposed a combining approach to avoid “over-penalizing” topically-diverse feeds. The local evidence of a blog feed plays a similar role to the passage-level evidence of passage retrieval. In addition, our approach provides a general framework by integrating global and local evidence, including PCS-GR as a special case (i.e. GLD+LSD).

4 Feedback model for blog feed retrieval

In the previous section, we showed how local evidence is explored for the initial retrieval of blog feed search, and verified that local evidence is helpful in improving retrieval performance. In this section, we further explore local evidence in terms of PRF, and propose novel feedback approaches based on local evidence.

4.1 Limitations of naive feedback approaches

Before addressing our feedback methods, we present two naive approaches for PRF and show why they are not desirable.

Because the retrieval unit of blog feed search is a blog feed, not a document, a blog feed is also a natural feedback unit. In this regard, a naive feedback model is an All-Posts approach, which chooses all constituent posts in the top-ranked feeds as feedback documents. However, due to the topical diversity of the blog feed, even if a blog feed is relevant, it does not mean that all of its constituent posts are relevant to a query. Furthermore, if some of the top-ranked feeds chosen for the feedback are irrelevant, almost all of the posts within them could be irrelevant. Therefore, the All-Posts approach has potentially high risk of selecting many irrelevant posts, which decrease the precision of the feedback information.

Another naive model is a Post-Level approach, which applies the traditional feedback approach to blog feed search. The approach first performs a post-level retrieval and then uses the top-ranked posts as feedback documents, without considering which feed they come from. Unlike the All-Posts approach, the Post-Level one does not suffer from the low precision of feedback information. However, the feedback information can be biased toward a dominant aspect within the top-ranked posts. In other words, the Post-Level approach may suffer from “aspect recall” (Kurland et al. 2005), one of the important properties which determines feedback quality.

With regard to query expansion for blog feed search, previous work has addressed some properties of blog feed search queries: “\(\ldots\) Given the nature of feed search, queries may describe more general and multifaceted topics, likely to stimulate discussion over time. If a query corresponds to a high-level description of some topic, there might be a wide vocabulary gap between the query and the more nuanced and faceted discussion in blog posts” (Elsas et al. 2008).

This property can make the aspect-recall problem of the Post-Level approach more serious, because the vocabulary gap may make the top N ranked documents more likely to be biased to a certain aspect of a given query. As a result, the feedback documents selected using the Post-Level approach will cover only a few aspects of a query.

4.2 Feed based selection

A blog feed consists of posts with diverse topics depending on the bloggers’ interests or inclinations. Thus, for a given query, the blog posts from different feeds may present different perspectives or facets of a topic, although they address information about the same topic. In other words, all (unknown) aspects of a query are scattered over all the relevant feeds, and their relevant posts. Therefore, if we gather information from various blog feeds, we can obtain more diverse information about a query so that it can cover the various aspects of the query topic, and this leads to the improved performance of PRF.

However, this approach can have the same problem as the All-Posts approach. To solve this problem, motivated by passage-based feedback, we propose Feed-Based Selection which first selects as many feeds as possible for PRF, and then gathers only a few posts within each of them, in order of the relevance between posts and the query. In other words, Feed-Based Selection uses local evidence on the top-ranked blog feeds. This method corresponds to passage-based feedback in ad-hoc retrieval where the scope of the feedback context is narrowed into the passage, rather than using the entire document context.

The Feed-Based Selection has two important characteristics that allow it to handle the problems of two naive approaches, All-Posts and Post-Level. First, it only uses the highly relevant posts of a top-ranked feed (local evidence), not entire posts (global evidence). In contrast with the All-Posts approach, it can alleviate the low precision problem caused by the topical diversity of a blog feed. Second, it collects more diverse information from as many feeds as possible. As a result, it allows a system to learn much more about the aspects of a query than the Post-Level approach, and leads to an increase in aspect recall.

Similar to the initial retrieval model presented in Sect. 2, one of the most important issues is how to define the local evidence of each blog feed. We propose two approaches for defining local evidence: Fixed Feed Based Selection and Weighted Feed Based Selection.

4.3 Fixed feed based selection (FFBS)

FFBS uses the top K ranked feeds to gather feedback documents. FFBS considers the top K ranked feeds as equally relevant to a given query regardless of their relevance to the query indicated by the relevance score.

Let FB FFBS be a set of blog posts chosen by using FFBS. We can define FB FFBS as follows:

$$ FB_{FFBS} = \left\{d|d_{m,k} \in F_{k}; k=1\cdots K, m=1\cdots M \right\} $$
(5)

where d m,k indicates the mth blog post, ranked in order of a score obtained by 4, within the kth ranked feed, and F k represents the kth ranked feed. In this paper, FFBS-K-M indicates a FB FFBS with K and M.

4.4 Weighted feed based selection (WFBS)

Similar to FFBS, WFBS also uses top K ranked feeds to construct feedback documents. However, WFBS chooses a different number of blog posts from each blog feed according to their relevance score. To achieve this, we assign differnet weights to the top k feeds in order of their relevancy.

Let N be the total number of feedback documents and FB WFBS be a set of blog posts chosen by using WFBS. We can define FB WFBS as follows:

$$ FB_{WFBS}=\left\{d|d_{m,k}\in F_{k}; k=1\cdots K, m=1\cdots M_{k} \right\} $$
(6)

where M k indicates the number of blog posts selected from each feed, and we define M k as follows:

$$ M_{k}=\frac{WF_{k}}{\sum_{j}WF_{j}}\times N $$
(7)

where WF i indicates the weight of the ith ranked blog feed. In practice, M k should be an integer number, so it is rounded to the nearest integer. WFBS-K-N denotes a FB WFBS with K and N.

There may be several methods to assign the weight WF i , but this paper uses a simple method defined as follows:

$$ WF_{i}=K-i+1 $$
(8)

where WF i is an inverted measure with respect to i, i.e. the blog feed with the highest score has a weight of K and the Kth feed has a weight of 1.

5 Feedback experiments

In this section, we investigate the influence of several document selection approaches on the performance of PRF.

5.1 Experiment setup

For feedback experiments, we used GSD+LSD as a baseline retrieval model, because it showed the best performance among the initial feed retrieval models in Sect. 4. The baseline model, GSD+LSD, is also used to perform PRF based on the expanded query model.

To update the query language model, we used model-based feedback (Zhai and Lafferty 2001).

$$ \theta_{Q^{'}}=(1-\alpha_{F})\theta_{Q}+\alpha_{F} \theta_{F} $$
(9)

where α F controls the influence of the feedback model, and the feedback model θ F is estimated by using a generative model of feedback documents.

5.1.1 Document selection approaches

We built several sets of feedback documents. Each set includes 10 documents as feedback documents. The document sets used for feedback are as follows:

  • TOP-10: 10 documents are chosen according to the relevance of the document. This approach is the Post-Level document selection.

  • Feed3All-Posts: All posts from the top 3 ranked feeds are chosen as feedback documents.

  • Feed5All-Posts: All posts from the top 5 ranked feeds are chosen as feedback documents.

  • FFBS-3-3: 10 documents are chosen using FFBS with K = 3 and M = 3.

  • FFBS-5-2: 10 documents are chosen using FFBS with K = 5 and M = 2.

  • WFBS-3-10: 10 documents are chosen using WFBS with K = 3 and N = 10.

  • WFBS-5-10: 10 documents are chosen using WFBS with K = 5 and N = 10.

5.2 Experimental results

Table 3 shows the performance of each selection method. The experimental results show that our feed-based selection approaches (FFBS and WFBS) significantly and consistently outperform the baselines. FFBS and WFBS increase the MAP score by 2–3% over the baseline models for all the topic sets. To check whether our methods show statistically significant improvements over the baseline, we performed the Wilcoxon signed rank test at 0.05 significance level for each metric, and attached the symbol ‡ to the scores for FFBS and WFBS only when they showed significant results over the baseline. As shown in Table 3, almost all runs of FFBS and WFBS showed statistically significant improvements over the baseline on MAP. This means that the feed-based approaches (FFBS and WFBS) are effective to improve the performance of PRF.

Table 3 The performance of the feedback models according to each feedback document selection approach

Furthermore, FFBS and WFBS show better performance than the two naive approaches: All-Posts (Feed3All-Posts and Feed5All-Posts) and Post-Level (TOP-10). To see whether the improvement is statistically significant, we again performed the Wilcoxon signed rank test, and attached § and ¶ only when they showed significant results over All-Posts and Post-Level, respectively. Footnote 5 We found that the majority of runs of FFBS and WFBS show statistically significant improvements over both of the naive approaches.

The All-Posts and Post-Level methods did not show reliable performance. They did not show any improvement over the baseline for most topic sets. First, the failure of the All-Posts approach provides good evidence that it suffers from low precision of feedback information. In particular, for the 08 topics, the top K feeds used for PRF are likely to contain many irrelevant feeds, because the initial performance for the 08 topics is relatively low. Thus, as K increases for the 08 topics, the feedback documents constructed using the All-Posts include too many irrelevant documents to improve the performance of PRF. Actually, when using K = 5, the performance deteriorated more seriously than when K = 3. This result explains why we need to use local evidence for PRF.

Second, for the 07 and 08 topics, the failure of the Post-Level approach supports our proposal for the feed-level selection. Post-Level suffers from low aspect recall so that it can only cover a few relevant aspects of a query. In contrast, our approaches enable the system to increase the aspect recall, because the feedback documents are chosen from various feeds which reflect the diverse aspects relevant to a query. Finally, this leads to the improved performance of the feedback model.

We compare our approaches with the top 3 performing runs Footnote 6 of the TREC 07, 08 and 09 Distillation task in Table 4. The results are obtained from (Macdonald et al. 2008; Ounis et al. 2009; Macdonald et al. 2010). Our feedback approaches significantly and consistently improve the results of the best runs for all tasks. In particular, for the 07 task, WFBS-5-10 achieved about a 6% increase of the MAP score over the TREC ’07 best run. FFBS-3-3 accomplished more than a 2% increase of the MAP score over the TREC ’08 best run. WFBS-3-10 also increased the MAP score by 12% over the TREC ’09 best run.

Table 4 The performance of the top 3 performing runs for the TREC 07, 08 and 09 Distillation tasks

Note that in Table 4, we only quote the official results of the top performing runs from TREC, and we did not implement them. Furthermore, we did not apply the significance test such as the Wilcoxon signed rank test between our methods and the TREC runs. Therefore, it is unclear what caused the difference in performances between our methods and the runs. The performance differences might be caused by several factors such as the method for preprocessing documents, the way for selecting parameters or the effectiveness of each algorithm for blog feed search. It will be valuable to implement the top performing algorithms and directly compare the results. We leave this issue for a remaining work.

5.3 Influence of K and M on performance

Figure 2 shows the performance of FFBS and All-Posts according to varying K and M parameters.

Fig. 2
figure 2

The MAP scores of FFBS according to varying K and M parameters, using GSD+LSD as the retrieval model, compared to the baseline and All-Posts. a The MAP scores for 07 topics. b The MAP scores for 08 topics. c The MAP scores for 09 topics

The FFBS methods show more reliable and better curves than All-Posts for all the topic sets. In particular, for the 08 and 09 topics, the performance gap between FFBS methods and All-Posts was very big for large values of K. From these results, we can again verify the effectiveness of local evidence to improve the performance of PRF.

The best parameter range of K for each method was different for each topic set. For the 07 topics, the best MAPs were found at relatively large K values between 5 and 7, while the MAP scores at small K(≤2) were not good. However, for the 08 and 09 topics, the trend for K is reversed, where the best MAP scores are obtained at relatively small K values between 1 and 3, while MAP scores at large K(≥5) decreased seriously.

Note that the performance curves are more robust on the 07 topics than the 08 and 09 topics for all methods including All-Posts. In other words, on the 07 topics, even at large K values, the MAP for each method did not seriously decrease, while on the others, when K ≥ 5, the MAP of all methods decreased sharply.

One possible explanation for the differing trend and robustness between 07 topics and 08, 09 topics can be obtained by comparing the performance of the initial retrieval for each topic set. From Table 3, we already saw that Pr@10 on the 07 topics is much better than those on the 08 and 09 topics. That is, the number of relevant feeds in the top-ranked ones will be more for the 07 topics than for the 08 and 09 topics. This may mean that the deterioration of the precision from using more feeds is not severe, resulting in reliable MAP scores. In contrast, for the 08 and 09 topics, when using a relatively large K value (about 5), the top K ranked feeds are likely to be irrelevant due to low Pr@10, so that the precision seriously decreases, causing a low MAP score.

When using M = 2, we obtained the most reliable performance, for all the topic sets, among the three values. Footnote 7 The results for M = 1 and M = 5 are on a case-by-case basis according to each topic set. For example, consider when M = 5. For the 07 topics, its MAP is the best, compared with the MAPs when M = 1 or M = 2. On the other hand, for the 08 topics, its MAP becomes worse than when M = 1 or M = 2.

5.4 Comparison of cluster centroid algorithm

As discussed in Sect. 4, our feed based selection is derived by considering the aspect recall. There is existing work for the ad-hoc retrieval task related to increasing the aspect recall. This work is called the Cluster Centroid approach (Shen and Zhai 2005). Cluster Centroid clusters the feedback documents to maximize the diversity of the feedback information (i.e. aspect recall). Cluster Centroid consists of the following 3 steps: 1) Group the top N documents into K clusters, 2) Select a centroid document from each resulting cluster, and 3) Use all such K centroid documents for feedback documents. Since Cluster Centroid does not use any information about the relationship between the posts and their feed, it can be viewed as an automatic method to construct feeds by regarding a cluster as a pseudo feed. We re-implemented the Cluster Centroid method in the same setting used in their experiment, by using the K-Medoid clustering algorithm (Kaufman and Rousseeuw 1990), and J-Divergence (Lin 1991) as the distance function between clusters. For a fair comparison to the previous section, we fix the number of clusters K to 10.

Table 5 shows the results of Cluster Centroid, according to the number of top posts N for clustering, where the feedback model is GSD+LSD. Note that when N = 10, the Cluster Centroid corresponds to TOP-N, since each post creates a separate cluster. The Cluster Centroid method outperforms the baseline for some N values by about 0.2%, 0.9% and 2.6% for the 07, 08 and 09 topics, respectively. This result of Cluster Centroid is important, because it confirms the view we previously discussed on the aspect recall, i.e., using diverse information is helpful to improve the performance of PRF for blog feed search.

Table 5 The performance of K = 10 Cluster Centroid with N under the GSD+LSD method

Our approaches are still notable, due to the improvements over Cluster Centroid. In particular, for the 07 and 08 topics, our best approaches show about 3.4% and 1.5% increases of MAP over Cluster Centroid, respectively. From these results, we can verify that the feed-level information used in our methods is important for improving the retrieval performance, because it captures a realistic structure between posts and feeds that Cluster Centroid cannot automatically recognize.

6 Related work

Since the TREC blog distillation task was introduced, many approaches have been suggested for blog feed search. Most approaches are motivated by other well-studied retrieval tasks such as the expert search task (Soboroff and de Vries 2007) and the resource selection task in distributed information retrieval.

Elsas et al. (2008) and Arguello et al. (2008), (2009) treated blog feed search as a resource ranking problem by using the ReDDE federated search algorithm (Si and Callan 2003). They proposed two blog representations based on granularity, and also suggested a query expansion approach using Wikipedia for blog feed search. For PRF, Elsas et al. (2008) proposed knowledge-intensive feedback, using Wikipedia as external knowledge. Despite its notable results, their approach is not a closed solution that only uses the given test collection, which is different from our approaches.

Seo and Croft (2008), (2009) dealt with blog feed search by using cluster-based retrieval for distributed information retrieval. They also divided blog sites into three types based on topical diversity, and considered several methods for penalizing blog sites with diverse topics.

Macdonald and Ounis (2008) and He et al. (2009) regarded blog feed search as an expert finding task. They used the adaptable Voting Model for the expert search task (Macdonald and Ounis 2006), and proposed several techniques that aim to boost blog feeds where a blogger has shown a central or recurring interest in a topic area. Carman et al. (2009) also used a similar approach, using the Voting Model. In contrast to Macdonald and Ounis’s work, they used non-content features for each blog in addition to existing content-level features, and applied the Learning-to-Rank (Yue et al. 2007) approach to combine the features and obtain a single retrieval function.

Nunes et al. (2009) suggested several strategies using temporal features for blog feed search. They examined whether or not the maximum temporal span covered by the relevant posts is a positive criterion in the feed search, and also investigated how the dispersion of relevant blog posts in a blog feed would impact this task. Wang et al. (2009) proposed a reduced document model by indexing text between certain tags, and used the PageRank of a blog feed with its query likelihood score. Balog et al. (2008) and Weerkamp et al. (2008) proposed two language models based on expert finding techniques, and some blog-specific features such as document structure, social structure, and temporal structure.

7 Conclusion and future work

In this paper, we have addressed several approaches for initial retrieval and pseudo-relevance feedback on blog feed search. Our key concern was the topical diversity of a blog feed. Motivated by passage retrieval techniques, we presented global and local evidence of blog feeds, corresponding to the document-level and passage-level evidence of passage retrieval. We estimated global evidence using all constituent posts within a blog feed, and local evidence using highly relevant posts within a blog feed in response to a given query. We proposed a series of methods for evaluating the relevance between a blog feed and a given query, using the two types of evidence.

In addition, we investigated the pseudo-relevance feedback method for blog feed search. Our feedback approaches, motivated by passage-based feedback, gathered feedback information using the local evidence of top K ranked feeds. The proposed methods have two advantages. First, the usage of various feeds enables the feedback model to locate the feeds that discuss different aspects of the topic of a given query. In other words, it increases the aspect recall of feedback information. Second, the usage of the local evidence provides the feedback model with information relevant to a query. That is, it increases the precision of feedback information. Experimental results on TREC distillation for the 07, 08 and 09 topics showed that the proposed feedback approach significantly and consistently outperformed the baseline.

Many studies remain for future work. First, for the initial retrieval, we used a simple uniform distribution as P(D|L) in (3). It would be interesting to investigate other methods to estimate P(D|L) such as Entry Centrality (Elsas et al. 2008). Furthermore, we would like to investigate other techniques for blog feed search such as link analysis and temporal profiling. These techniques have the potential to improve the performance of blog feed search. Second, for pseudo-relevance feedback, we will explore a probabilistic approach for selecting the relevant local posts, instead of using the current threshold-driven method.