Iterative exploration, design and evaluation of support for query reformulation in interactive information retrieval

https://doi.org/10.1016/S0306-4573(00)00055-8Get rights and content

Abstract

We report on the progressive investigation of techniques for supporting interactive query reformulation in the TREC Interactive Track. Two major issues were explored over four successive years: methods of term suggestion; and, interface design to support different system functionalities. Each year's results led to the following year's investigation, with respect to both of these issues. This paper presents first the general motivation for the entire series of studies; then an overview of each year's investigation, its results, and how they influenced the next year's investigation. We discuss what has been learned through this series of investigations about effective term suggestion, usable and useful interface design, and the relationships between these two in support of the TREC Interactive Track task. We conclude with comments about the general methodology employed over this series of studies, and its relevance to the development and evaluation of interactive information retrieval systems.

Introduction

Query formulation, and especially query reformulation, are understood to be among the most difficult tasks that users in interactive information retrieval (IR) systems face. A variety of techniques have been proposed for addressing this general problem, throughout the entire history of IR research (cf. Efthimiadis, 1996). However, very few of them have actually been tested in interactive IR environments. The research described here represents a principled attempt at taking one well-known technique for supporting query reformulation, relevance feedback (RF), and investigating its effectiveness and usability by implementing and evaluating it in the context of a specific interactive IR task.

Interface design for IR systems, although long understood to be an important problem for IR in general (cf. Walker, 1971), has only fairly recently become an active area of research in IR (e.g., Fox et al., 1993, Hearst and Karadi, 1997, Swan and Allan, 1998, Williamson and Shneiderman, 1992). Most IR research has focused upon system functionalities, ignoring interface questions, leading to a situation in which interfaces are often seen as things that are put on top of IR systems, rather than being integral parts of them (cf. the excellent review by Marchionini & Komlodi, 1999). Here, we present the results of a series of studies of interactive IR systems which attempted to address this problem by integrating interface design with development of the RF and other functionalities related to query reformulation. We construe this series of studies as an example of the iterative evaluation/design cycle, as in Egan et al. (1989), but perhaps extending that model in our emphasis on integration of interface with function.

We describe the progressive development of an IR system and its interface designed to support a particular IR task, and to support specific user problems with that task, and with IR systems in general. This was embodied in a series of studies carried out within the Text REtrieval Conferences (TREC) Interactive Track. The starting point for this series of studies was the idea of testing the usability, use and effectiveness of automatic relevance feedback (RF) in interactive IR, embodied in work reported in Belkin, Cool, Koenemann, Ng, and Park (1996). Some results from this investigation, and related work (Koenemann's, 1996), suggested that although system-controlled RF using only positive relevance judgments was usable and useful, additional functionality might be desirable. Specifically, user-controlled RF (RF as a term suggestion device) and RF using both negative and positive judgments were features that users seemed to desire. These results led to a sequence of four studies, implementing and/or studying these new functionalities in various interface structures, using a basic common underlying IR system (Belkin et al., 1997, Belkin et al., 1998, Belkin et al., 1999, Belkin et al., 2000). The reports of these individual studies discussed issues specific to each one, and generally did not focus on explicit interface issues. In this paper, we present an integrated discussion of the entire series, with equal emphasis on functionality and interface design. For detailed discussion of each study, see the appropriate TREC publication.

We began this series of studies in the TREC-5 Interactive Track (Belkin et al., 1997), by investigating the use, usability and effectiveness of a system which implemented RF in the standard way in which it has been suggested for interactive IR; that is, by automatically adding terms to a query, based on the documents which had been judged relevant by the user. This study, like all subsequent ones in the series, was based on the instance recall task set by the TREC Interactive Track (Over, 2001). This task requires subjects to identify the different aspects or instances of a topic, and to save documents which represent those instances. The results of this study (and of a previous study in TREC-4, Belkin et al., 1996), led to some quite explicit changes in both the RF functionality, and the system and interface in which it was implemented, for our next study, in the TREC-6 interactive track.

Three major changes were made with respect to RF for our TREC-6 investigation (Belkin et al., 1998). One was the implementation of RF as a term suggestion device, rather than an automatic query expansion device. This followed from both the results of our TREC-5 study, and Koenemann's (1996) results indicating that RF as term suggestion is preferred to, and works at least as well as automatic RF. Another was allowing both positive and negative relevance judgments to be made, leading to the suggestion of both “good” terms to add to the query with positive weights, and “bad” terms to add to the query with negative weights. The third change, related to the second, was to implement what we have elsewhere termed a “revisionist” version of RF (Cool et al., 1996, Belkin et al., 1997). Since that model is rather different from the “standard” version of RF, and since it was used in our studies in TRECs 6, 7 and 8, we discuss it in some detail in Section 1.2. The results of our TREC-6 study led us to make changes in the way in which RF was presented to the system user, and in a variety of interface-related issues, whose effects were investigated in our TREC-7 study.

In TREC-7 (Belkin et al., 1999), the underlying RF implementation was as in our TREC-6 system, and the main goal was still to investigate the utility of negative RF as implemented in our revisionist model. However, the conceptual model that was presented to the user became one of “term suggestion” rather than automatic, system-controlled RF, and the interface was redesigned to take account of these changes, to respond to some specific problems indicated by the subjects in TREC-6, and to better respond to some general HCI design principles. The results of the TREC-7 study were by-and-large positive with respect to usability issues, but effectiveness and perceived usefulness of term suggestion for the task were not what might have been wished. These results led us to investigate a different mode of term suggestion in our TREC-8 study, as well as to make the system design more directly related to the task itself.

In TREC-8 (Belkin et al., 2000), instead of comparing different versions of RF for term suggestion, we compared RF-based term suggestion to another mode which we hypothesized would be better suited to the instance recall task. In addition, we continued to change the conceptual model of the system that was presented to the user, and to change various interface characteristics to respond both to the task, and to difficulties experienced by the users in TREC-7. This study concluded the series of investigations reported here. Although there are still a few open questions, the results of our TREC-8 study lead us to believe that we have arrived at what seems to be a reasonable and effective way to support query reformulation for the instance recall task in terms of both functionality and interface design.

Automatic RF is well known to be an effective tool for query reformulation in situations in which large numbers of relevance judgments are available, for instance, in information routing or information filtering. Its efficacy has been demonstrated in such situations in a wide variety of studies, almost all of them being done in what can be roughly characterized as batch-mode, non-interactive, experimental IR test-collection situations (cf. Salton and Buckley, 1990, Spink and Losee, 1996). However, there have been rather few investigations of RF in interactive IR environments in which there are small numbers of relevance judgments (e.g., the TREC studies reported on in this paper; Koenemann's, 1996, Robertson et al., 1999, Yang et al., 1999). The results of most of these studies have been either inconclusive, or negative, with the notable exception of Koenemann's (1996), at least when evaluated by traditional IR measures.

The basic requirement for RF to be implemented in batch-mode environments is the existence of a large set of documents on which exhaustive judgments of their relevance, or not, to a specific query have been made. RF of this sort, which we here call the “classic” model of RF, is based on ideas first proposed by Rocchio (1971), and can be succinctly characterized as follows:

  • Based on the concept of reaching an “ideal query”.

  • The ideal query is the best discriminator between relevant and non-relevant documents.

  • Query terms should be “optimal” discriminators.

  • Terms which appear only in the original query, or only in positively judged texts are good.

  • Terms which appear in both positively and negatively judged texts are bad because they are poor discriminators.

  • Terms which appear only in negatively judged texts are ignored, since they offer no information about discrimination value.

Under this model, the query-term weights of terms which appear in the query and positively judged documents are progressively increased through iterations of RF, while the weights of terms which appear in both positively and negatively judged documents are progressively reduced until they reach zero weight, when they are typically removed from the query (some unpublished experiments in non-interactive environments have shown that using negative weights decreases performance). Query expansion, the most effective aspect of RF (Harman, 1992) is through adding terms to the query, with positive weights, which are important terms in the set of positively judged texts.

In contrast to the classic model, we suggest a new, revisionist model of RF, which attempts to take into account the results of our studies of people's information seeking behaviors in interactive IR environments, and which can be characterized as follows:

  • The distinction between relevant and non-relevant texts which have the same terms is:

    • the terms are used in different contexts, or

    • the topics are treated peripherally, or

    • the topics are treated from an inappropriate point of view, or

    • polysemy

  • RF should distinguish between appropriate and inappropriate treatments of topics.

  • Terms which appear in the query, or in positively judged texts, whether or not they also appear in negative texts, are good.

  • Terms which appear only in negatively judged texts are bad because they are indicators of inappropriate context, peripheral or inappropriate treatment of the topic, etc.

  • Bad terms should be used for query expansion with negative weights, and good terms with positive weights.

In this model, important terms in the negatively judged documents, which do not appear in positively judged documents, are understood as indicators of the inappropriate context, or the main topic, or the inappropriate point of view. This model thus leads us to a quite different way to implement RF. Query terms which appear in positively judged documents (irrespective of their appearance in negatively judged documents) have their query-term weights increased; and, the query is expanded by both the important terms in the positively judged documents (with positive weights) and by the important terms in the negatively judged documents which do not appear in the query or the positively judged documents (with negative weights). Details of how this model of RF was implemented in our studies are presented in Section 4.2.

The next section describes the general methodology that we used in all four of the studies in this series, and defines the basic measures that were used to evaluate each system. Then, we give an overview of each of the four studies, in 3 TREC-5, 4 TREC-6, 5 TREC-7, 6 TREC-8, describing the study goals, the systems that were used, the experimental subjects, and the results, concluding with how the results affected what we investigated in the next study. We then discuss the results of the entire series of studies, and conclude with some remarks on the implications of this work for the design and evaluation of interactive IR systems.

Section snippets

General methodology

All of the studies reported in 3 TREC-5, 4 TREC-6, 5 TREC-7, 6 TREC-8 were conducted under the rules of the TREC Interactive Track for the relevant years (see Over, 2001 for the complete rules and research designs for TRECs 5–8, and Table 1 for a summary overview of our studies in each year). The specifics of each of the studies are described in some detail in each of those sections, but there are some features common to all of them, which we summarize here.

All our studies followed a similar

Study goals

In TREC-5, our approach was to further develop the conceptual work in the area of RF, reported by our group in previous TREC-3 and TREC-4 experiments in the ad hoc task (Koenemann et al., 1995, Belkin et al., 1996). In TREC-5, this theoretical interest in RF led to an investigation that explicitly conformed to the Interactive Track task. Our work in TREC-5 formed the baseline for the iterative development of the system functionalities and interface modifications reported throughout this paper.

Study goals

The primary goals of this study were:

  • to investigate the effectiveness and usability of negative RF in interactive IR.

  • to investigate the use and usability of RF as a term suggestion device;

  • to investigate aspects of the revisionist model of RF.

We attempted to accomplish these goals by implementing a version of RF which suggested both positive and negative terms for addition to the query, rather than automatically expanding the query with those terms.

System

We used InQuery 3.1p1 as the basis for our

Study goals

As in TREC-6, a main goal of this study was to investigate the effectiveness and usability of negative RF. A second goal was to investigate both positive and negative RF as user-controlled term suggestion. We also wanted to investigate the effects of the changes that were motivated by our TRE6 results. These goals were accomplished by comparing two systems using the same interface, one offering positive and negative RF (INQ-R, Fig. 3), and the other offering only the positive RF feature (INQ-G).

Study goals

The primary goal of the TREC-8 study was to investigate the effectiveness and usability of two different term suggestion methods for interactive IR. This focus was a result of the observations in TREC-7 that the term suggestion feature was not used as often as we had expected, and that the subjects did not find it all that useful with respect to the task. We were also concerned to further investigate the issue of user control of term suggestion. The two methods that we compared were user

Discussion

The series of studies described in this paper began with an investigation of the use of a relatively standard version of automatic RF for support of query reformulation in the TREC Interactive Track task of instance recall, in a rather standard IR interface. Based on our analyses of the results of each study, modifications in both the functionality and interface of each system were made, and new issues were investigated in each successive study. The final study in the series investigated

Conclusions

We believe that more can be concluded from this series of studies, in addition to the specific results with respect to the problem of support for query reformulation in the instance recall task. In particular, it seems that having conducted a series of related studies, using the same or similar methods and measures in each one, allowed us to develop a meaningful sequence of principled changes and issues to be investigated. Being able to do this within the TREC context, especially within the

Acknowledgements

We dedicate this paper to the memory of Russell Swan, whose research and comments so often inspired us. We wish to thank all of the subjects who participated in our TREC studies, all donating over three hours of their time to helping us in our research. We also wish to thank all of the other people who participated with us as researchers in our TREC studies; their names are indicated in the author lists in the citations to our TREC publications. We owe a special debt of gratitude to our

References (28)

  • Belkin, N. J., Cool, C., Koenemann, J., Ng, K. B., & Park, S. Y. (1996). Using relevance feedback and ranking in...
  • Belkin, N. J., Cabezas, A., Cool, C., Kim, K., Ng, K. B., Park, S. Y., Pressman, R., Rieh, S. Y., Savage, P., & Xie, H....
  • Belkin, N. J., Perez Carballo, J., Cool, C., Lin, S., Park, S. Y., Rieh, S. Y., Savage, P., Sikora, C., & Xie, H....
  • Belkin, N. J., Perez Carballo, J., Cool, C., Kelly, D., Lin, S., Park, S. Y., Rieh, S. Y., Savage-Knepshield, P., &...
  • Belkin, N. J., Cool, C., Head, J., Jeng, J., Kelly, D., Lin, S. J., Lobash, L., Park, S. Y., Savage-Knepshield, P., &...
  • P Borlund et al.

    The development of a method for the evaluation of interactive information retrieval systems

    Journal of Documentation

    (1997)
  • Callan, J. P., Croft, W. B., & Harding, S. M. (1992). The INQUERY retrieval system. In Dexa 3, Proceedings of the third...
  • Cool, C., Belkin, N. J., & Koenemann, J. (1996). On the potential utility of negative relevance feedback in interactive...
  • Efthimiadis, E. N. (1996). Query expansion. In M. E. Williams (Ed.), Annual review of information science and...
  • D.E Egan et al.

    Formative design-evaluation of SuperBook

    ACM Transactions on Information Systems

    (1989)
  • E.A Fox et al.

    Users, user interfaces and objects: envision a digital library

    Journal of the American Society for Information Science

    (1993)
  • Harman, D. K. (1992). Relevance feedback revisited. In N. J. Belkin, P. Ingwersen, & A. Mark Pejtersen (Eds.), SIGIR...
  • Hearst, M. A., & Karadi, C. (1997). Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval...
  • Koenemann, J. (1996). Relevance feedback: usage, usability, utility. Ph.D. Dissertation, Department of Psychology,...
  • Cited by (81)

    • A study of the influence of task familiarity on user behaviors and performance with a MeSH term suggestion interface for PubMed bibliographic search

      2013, International Journal of Medical Informatics
      Citation Excerpt :

      For literature search in health sciences specifically, several attempts have been made to exploit term co-occurrence for term suggestion, with terms extracted either from a controlled vocabulary [9] or free-text in the article abstracts [10–12]. Researchers have used the techniques of relevance feedback [13,14], real-time interactive query expansion [15] and more recently a hybrid approach that includes query logs [16] to support the user's query formulation tasks. At the level of user search interfaces, the visualization of document interrelationships [17], explicit term distribution information [18] and interfaces in support of search results navigation [19,20] have been proposed to help users refine their queries.

    • Entity-Based Relevance Feedback for Document Retrieval

      2023, ICTIR 2023 - Proceedings of the 2023 ACM SIGIR International Conference on the Theory of Information Retrieval
    • Qbias-A Dataset on Media Bias in Search Queries and Query Suggestions

      2023, ACM International Conference Proceeding Series
    • Query Refinement into Information Retrieval Systems: An Overview

      2023, Journal of Information and Organizational Sciences
    View all citing articles on Scopus

    Some of the research reported here was supported by the DARPA TIPSTER Phase 3 Program, under contract number MDA904-96-C-1297, and by Graduate Associateships for Kelly and Lin from the Rutgers Distributed Laboratory for Digital Libraries.

    View full text