1 Introduction

The experiments reported in this paper were conducted as part of a workshop on Reliable Information Access (RIA) funded by the intelligence community through its Advanced Research and Development Activity (ARDA) (Harman and Buckley 2009). These experiments directly compare the document and term selection methods used for blind feedback by the participating information retrieval systems: CLARIT, CLJ, HITIQA, Lemur, MultiText, Okapi, and SMART.

In all cases, principals involved in the development or deployment of these systems cooperated in the experiments and provided technical advice. Through these experiments we hoped to determine if benefits could be gained through an exchange of information between systems at intermediate stages of the retrieval process. The work reported here differs from TREC in the level of control that could be exerted over the participating systems. Terms and documents from one system could be transfered to another, with experts guiding the process. Stopword lists, stemming algorithms and system parameters could be adjusted on demand.

This paper reports the details of three RIA experiments. The first experiment (Sect. 2) focuses on the exchange of feedback documents, the second (Sect. 3) on term selection, and the third (Sect. 4) on the fusion of intermediate results from multiple systems.

1.1 Blind feedback

Blind feedback (or “pseudo-relevance” feedback) incorporates two retrieval stages. In the first stage, the IR system formulates a set of query terms from the user’s initial information request (or “topic”) and retrieves the top k documents. In the second stage, the system automatically selects additional query terms from this initial document set and adds them to the original query. Characteristics of the retrieved documents may be used to adjust weights for both the new and old query terms. The system then evaluates this augmented query to generate the final result set.

Figure 1 illustrates the blind feedback process. The diagram has been divided into two sections, one for each of the feedback stages. The first section is responsible for retrieving initial documents or sub-documents for the feedback section. The second section takes the initial documents or sub-documents and analyzes them to find additional terms. Each section is further separated into different components. All systems participating the RIA workshop can be mapped to this general approach. In most systems the IR techniques used for the initial and final retrieval were identical, but several systems used a passage (or “sub-document”) retrieval technique in the first stage and a document retrieval technique in the second.

Fig. 1
figure 1

Blind feedback

1.2 Evaluation framework

The experiments utilize the description fields of 150 topics created for the TREC 6, 7 and 8 adhoc retrieval tasks (Voorhees and Harman 1997; Voorhees 1998; Voorhees and Harman 1999). Each definition field briefly describes documents relevant to the topic, often in a single sentence. For example, the description for topic 355 from TREC-6 is:

  • Identify the development and application of spaceborne ocean remote sensing .

All participating systems formulated initial queries from these descriptions by eliminating stopwords and stemming the remaining words. Some systems identified phrases appearing in these topics and added these phrases to the query as additional terms, and some systems used term frequency within the description as a factor when computing term weights.

Queries were executed over a target corpus consisting of disks 4 and 5 of the TREC collection minus the Congressional Record documents. This corpus consists of 528,155 documents taken from the Financial Times, the Federal Register, the Foreign Broadcast Information Service and the LA Times. Relevance judgments for these documents are available for all 150 topics.

Following the conventions established at TREC, each experimental run consists of the top 1,000 documents for each of the topics. Mean average precision (MAP) is our primary evaluation measure; the MAP values reported in this paper were computed by a variant of the trec_eval program. Further information on the TREC test collection is available from the TREC website. Footnote 1

1.3 Participating Systems

The following systems and groups participated in the experiments. Additional details regarding the participating groups and their approach to the blind feedback problem may be found in the RIA overview report elsewhere in this volume (Harman and Buckley 2009). Note that both CMU and the University of Massachusetts employed a version of the Lemur system, and that HITIQA incorporates an older version of the SMART system.

  • CLARIT and CLJ, Clairvoyant Corp. (Evans and Lefferts 1994, 1995)

  • HITIQA, SUNY Albany (Small et al. 2004; Strzalkowski et al. 2004)

  • Lemur (UMASS), University of Massachusetts Amherst (Lemur)

  • Lemur (CMU), Carnegie Mellon University (Lemur)

  • MultiText, University of Waterloo (Yeung et al. 2003; Clarke et al. 2001)

  • Okapi, City University London (Robertson and Jones 1976; Robertson et al. 1995; Robertson 1990)

  • SMART, Sabir Research (Buckley 1985; Williamson et al. 1971)

1.4 Baseline system performance

At the start of the workshop each group generated a blind feedback run using the standard parameters and methods of their own system. The details of these runs are summarized in Table 1. The “# Docs” column lists the number of document the system used for feedback. The “# Terms” column lists the number of terms added during the feedback process; original term are not included in this count. A system using “Sub-Docs” may use less than an entire document for feedback.

Table 1 Baseline system performance

In the next two columns the table gives MAP values for two runs: a non-feedback run (“No BF”) and a blind feedback (“BF”) run. The “No BF” runs represents each system’s default behavior when blind feedback is disabled. These are not simply the documents returned by the first retrieval stage when blind feedback is enabled, since some systems use different tuning parameters when blind feedback is not used. These runs illustrate the impact of blind feedback, which is greater than 10% for all systems and more than 20% for some. Most differences between the non-feedback and blind feedback runs are significant (Wilcoxon signed-rank test, p < 0.01). Footnote 2 None of the differences between the top-four systems is significant.

The last row of the table gives the results of fusing the runs using the well-established CombMNZ algorithm (Fox and Shaw 1994). It is well known that combining results from more than one system can produce a higher performance than any system alone (Croft 2000). CombMNZ is a simple but effective fusion algorithm, often used as a baseline in fusion experiments (Montague and Aslam 2002). The CombMNZ algorithm first normalizes the relevance scores from each system for each topic. A new score is computed for each document by summing the document’s normalized score from each system and then multiplying by the number of systems that retrieved the document.

The RIA workshop provided an unusual opportunity to experiment with fusion at early stages of the blind feedback process. These experiments will be reported in a later section. The CombMNZ algorithm is used in all fusion experiments reported in this paper.

2 Swapping documents

Each of the eight systems that participated in the workshop can be divided into the two stages of our feedback model (Fig. 1). This experiment (the “swapdoc” experiment) investigated the effect of interchanging the two stages among the systems as shown in Fig. 2. Our goal was to determine how much the initial retrieval strategy of each system affects whether blind feedback works.

Fig. 2
figure 2

Swapping feedback documents

2.1 Swapdoc—experimental design

For swapdoc, the initial stage of each system was used to produce a fixed number of documents, which then became the input to the second stage of other systems. All groups prepared a list of their initial retrieved documents in the standard TREC result format. Then each group conducted blind feedback runs using each other’s list of initial retrieved documents as the source of expansion terms, but using their own methods and default parameters to choose and weight terms.

Some systems perform blind feedback using sub-documents instead of full documents. Fortunately these systems used only a single sub-document from any given document; therefore, the documents containing these sub-documents could be provided to the other systems. Systems using sub-documents needed an additional step after receiving documents from another system, to find the best sub-document from the documents provided.

Systems vary in the number of documents or sub-documents required for blind feedback; the maximum number of documents used by any system was 60. Therefore, each system was required to provide an ordered list of 60 documents per topic. Systems were permitted to choose their own best parameters for the number of documents and feedback terms, but were required to keep these values constant for all topics. Otherwise, documents provided by other systems were processed using each system’s normal term selection and weighting methods.

Overlap, defined as the percentage of documents in one run that are also found in the other, is an indicator of the similarity of the runs. As an example, Table 2 shows the document titles for the top initial documents from CMU and SMART for topic 355. The average overlap across topics for the 60 initial documents is detailed in Table 3. Average overlap ranges from 22 to 88%. For most pairings, the overlap is less than 50%. The UMass and CMU runs are based on different versions of the same retrieval system (Lemur), and this may be the cause of the particularly high overlap between these systems (88%).

Table 2 Topic 355 top document titles
Table 3 Initial document overlap (top 60)

2.2 Swapdoc—Results

Since systems are tuned to their own documents, we expected systems to achieve their best performance with their own documents and lower performance with documents from other systems. However, the experiment revealed that using another system’s documents can often increase performance, and that the choice of the producing system can have an enormous performance impact. All 64 combinations of producing and consuming systems are presented in Table 4.

Table 4 Document swapping—mean average precision

Reading the scores down the column for MultiText or UMass, scores vary by at most 10% as the source of top documents change. For other systems, like CLARIT, scores may vary by as much as 50%. These differences in sensitivity are somewhat surprising, given the uniform improvement from just using blind feedback. One explanation is that some groups may get their feedback improvement from the reweighting of the original query terms, rather than the effect of term addition.

Another surprising feature is how often systems prefer to consume documents from other systems rather than their own documents. For example, all systems do as well or better when consuming documents produced by SMART than they do with their own documents. Evaluation scores of Sabir’s top documents are less than some other systems on average, though not significantly. CMU prefers Sabir’s documents by almost 20% over UMass’s documents, while other systems consider them about the same. Despite efforts we do not have an explanation for this observation. In general, though, perhaps the use of other system’s documents makes the retrieval more robust, for much the same reason that fusion of runs improves performance. The weaknesses of one system may be counterbalanced by the strengths of another.

Table 5 shows that some systems are good consumers and others are good producers. “Producing Average” is the average performance of all eight systems on the given system’s document set. “Consuming Average” is an average of the given system’s performance using all the document sets produced by the eight systems. HITIQA has an average increase of 21% using other systems’ documents while other systems have an average decrease of 12.7% when using HITIQA documents. This would indicate that HITIQA (which incorporates an older version of SMART) is a poor producer of documents. In contrast, consuming SMART documents gives an average increase of 6.8%, suggesting that SMART produces superior documents.

Table 5 Swapping document averages

The swapdoc experiment demonstrates that blind feedback improves performance independent of consuming and producing systems (Fig. 3). The graph reveals feedback improvement of systems consuming documents produced from other systems. HITIQA has the largest improvement in performance from feedback, but it was also the lowest performing system. SMART, which produced the best initial documents, had the lowest improvement from feedback.

Fig. 3
figure 3

Swapping documents feedback improvements

2.3 Swapdoc with fixed parameters

For the swapdoc experiment each system used its standard blind feedback settings, including number of documents and number of terms. To facilitate the term swapping experiment discussed in the next section, we re-ran the experiment with fixed parameters: 20 initial feedback documents and 5 terms. The results appear in Table 6. The performance of many systems is seriously harmed by fixing the parameters. Once again, SMART is the best producer of feedback documents.

Table 6 Document swapping with fixed parameters (20 documents, 5 terms)—mean average precision

3 Swapping terms

The focus of the second experiment (the “swapterm” experiment) was to determine the term selection component’s effect on performance for different systems. For this experiment, each system produced a set of terms using its blind feedback method and consumed terms generated by other systems. The process diagram for the swapterm experiment appears in Fig. 4.

Fig. 4
figure 4

Swapping feedback documents and terms

3.1 Swapterm—experimental design

Swapping terms is far more complicated than swapping documents. Externally, documents may be represented by document identifiers, and a document swap requires only an exchange of these identifiers. On the other hand, systems represent terms in different ways, and there is no simple, common representation. Due to these differences, one system may produce terms that cannot be consumed by another.

Stemming is the most common reason why a system produces a term that another cannot consume. For example, some systems stem the word “antarctic” to “antarct-” while others do not stem it at all. This problem is compounded by the fact that many systems only index the stemmed version. Therefore, the unstemmed version would not appear in an stemmed corpus or, vice versa. To solve this problem, systems were required to produce the most common unstemmed version of each term; each system then applied its normal stemming method.

Phrases also had to be considered, since not all systems could consume phrases. If a system could consume a phrase it was kept, otherwise it was discarded. In the majority of cases, single-word terms could be consumed by all systems. In the few cases where terms could not be consumed by all systems, the terms contained unusual features, such as numbers, or were stopwords in the consuming system.

There is surprisingly little overlap in the term sets. Term overlap percentages are given in Table 7. For the example topic 355, the following table shows the top 5 terms produced by CMU and SMART:

 

CMU

SMART

1

Research

Satellite

2

Satellite

Radar

3

Technology

Surface

4

Basic

Image

5

Data

Science

Table 7 Term overlap (top 5 terms)

Another problem we needed to overcome was that each system uses a different term weighting scheme. To solve this problem, each system used its own scheme to compute weights for consumed terms. To ensure that weights could be computed for these terms, the producing system also provided the documents from with the terms were taken. To simplify this process, we fixed the feedback parameters for the experiment at 20 documents and 5 terms.

Given documents and terms from a producing system, the consuming system stemmed the terms and generated weights by processing the documents provided. These weighted terms and original query were combined into a new query and executed to retrieve the final result set.

3.2 Swapterm—results

The results of the swapterm experiments appear in Table 8. The baseline for this table is a system’s own performance when feedback parameters are fixed at 20 documents and 5 terms (Table 6).

Table 8 Swapping terms—mean average precision

In all but a few cases, consuming terms generated by another system harms performance. One exception is CLARIT consuming SMART’s terms which gives a 24% improvement. The terms provided by CLARIT harm performance in most cases, but this effect is most likely due to the numerous phrases produced by CLARIT, which could not be consumed by other systems.

4 Swapping and fusing

Fusing results from multiple systems often produces better performance than that of the individual systems alone (Croft 2000). In a third experiment (the “swapfuse” experiment) we examined the impact of fusion on feedback, both by fusing initial document sets and by fusing the results from a single system when consuming different sets of initial documents. Having access to intermediate data allowed us to explore fusion over a wide range of system parameters.

We used three different fusion methods in our experiments. For initial result fusion experiments a target system consumes an initial result set which is the fusion of the initial results from all eight of the participating systems. For fused common consumer experiments we fused the results of a target system over eight different runs, each using the initial documents produced by a different system. For fused common producer experiments we fused the results of eight different runs, each using the results of the system’s initial results consumed by a different system. Figure 5 illustrates the differences between these experiments.

Fig. 5
figure 5

Fusion experiemnts

Figure 6 summarizes the results of the swapfuse experiment. The first two bars in each group represent a system’s baseline performance from Table 1, with and without blind feedback. The fusion results from that table are also reproduced. For each system, the last three bars in the group represent our three different methods of fusing it with the other systems.

Fig. 6
figure 6

Fusion results

The group titled “fusion” gives “fusion of fusion” results. Each column represents the fusion of results corresponding to the eight corresponding columns to the left. Finally the lone column labelled “Fusing 64 Combinations” gives the fused result of 64 runs representing all combinations of producer and consumer components.

Of all the results, “Fused Common Consumer” gives the best MAP of 0.2654, 13% better (p = 10−6) than the best BF system, and 4% better (p = 10−5) than the simple fusion of the 8 BF runs. “Fused Common Consumer” is marginally better than “Fusing 64 Combinations” (p = 0.04) and “Fused Common Producer” (p = 0.04).

4.1 Fusing subsets

We investigated the possibility that one or more of the systems might produces a deleterious effect on the system, and that fusion would be improved by removing these systems. To this end, we created all 255 non-empty subsets of the 8 BF runs and computed the MAP for the fusion of the results from each subset.

Figure 7 shows several views of these results as a function of the size of the subsets. The curve “Max” shows the best result that could have been achieved had we an oracle to identify the particular subset that would achieve the best score. This curve provides an upper bound on what we might have achieved. At n = 6, this value exceeds that for n = 8 (the fusion of all systems) by only a small amount. The complementary curve “Min” shows the worst result that could have been achieved.

Fig. 7
figure 7

Fused systems

“Best” presents the result of using the best n systems, as ranked by their individual MAP results. At least for this set of systems, little is gained from excluding the worst-performing system. “Worst” presents the complimentary result. From this curve we may conclude that excluding one or two of the best systems has little impact on the effectiveness of fusion.

“Avg” presents the average MAP over all subsets. From this curve we may conclude that picking a random subset of the systems produces intermediate results for small subsets, but results comparable to “Worst” for larger subsets.

5 Concluding discussion

In an effort to better understand the blind feedback process, researchers participating in the RIA workshop compared information retrieval systems on a component by component basis, swapping and combining information after each retrieval stage. By decomposing the original systems into their individual components and examining each component separately, several areas where additional performance improvements might be obtained were discovered.

From the document swapping experiments (Sect. 2), we were surprised to discover that one system can outperform another as a producer of feedback documents, even when there is no significant difference in the original systems. For example, there is no significant difference between the baseline performance of SMART, Lemur (CMU) and MultiText, with or without blind feedback (Table 1). Nonetheless, the performance of both Lemur (CMU) and MultiText improves significantly when documents from SMART’s first retrieval stage are used for blind feedback. Indeed, the best overall performance in the document swapping experiment was obtained by Lemur (CMU) when consuming documents from SMART.

Overall, this experiment reveals that initial document selection greatly affects performance. One system consuming feedback documents produced by another system can give better performance than either system can achieve alone. In general, a system that produces initial documents that increase performance for one consuming system tends to increase performance for all consuming systems. Conversely, a system that produces initial documents that decreases performance for one consuming system tends to decreases performance for all consuming systems.

First-stage retrieval (for blind feedback) and second-stage retrieval (for presentation to the user) should be treated as different problems. The characteristics that make a document good for feedback are not necessarily those that make it more likely to be relevant. Some systems, such as CLARIT/CLJ and MultiText already reflect this observation, by using a passage retrieval technique in the first stage. In addition, a document may discuss one facet of a request in detail but ignore another, making it a rich source of terms related to the first facet, but at the same time, not relevant. Perhaps this property provides a partial explanation for the performance of the documents produced by the SMART system—SMART’s initial retrieval used adjacent word pairs that may have included multiple facets. However, we could find no confirming evidence for this hypothesis.

Term swapping (Sect. 3) did not generate the same level of improvement as document swapping. In most cases, swapping terms hurts system performance, though there were cases where swapping terms greatly increased performance (Table 8). However, we note that the average overlap among the top terms is fairly low, between 2 and 32% (Table 7). In future work we will exploit this observation by exploring ways of combining term lists generated by difference feedback techniques. A term ranked highly by many techniques may be a better expansion term than one ranked highly by a single technique.

The best overall performance was achieved by swapping documents and fusing the results (Sect. 4). Despite the simplicity of our fusion technique, this result was 13% better than that achieved by any system on its own. Like the term swapping experiment, the results of the fusion experiment suggest that a systematic approach to combining intermediate results may produce substantial benefits.