A retrospective study of a hybrid document-context based retrieval model

https://doi.org/10.1016/j.ipm.2006.10.009Get rights and content

Abstract

This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.

Introduction

Various retrieval models have been developed and investigated over the past several decades based on a variety of mathematical frameworks (Dominich, 2000). For example, Salton, Wong, and Yang (1975) and Wong, Ziarko, and Wong (1985) worked on retrieval models based on vector spaces. The Binary Independence model (BIM) (Robertson & Sparck-Jones, 1976), the logistic regression model (Cooper, Gey, & Dabney, 1992), the 2-Poisson model (Harter, 1975) and its later practical approximation (Robertson & Walker, 1994) and the language modelling approach (Lafferty and Zhai, 2001, Lavrenko and Croft, 2001, Ponte and Croft, 1998) are based on the probability theory. The fuzzy retrieval model (e.g., Miyamoto, 1990) and the extended Boolean model (Salton, Fox, & Wu, 1983) are based on the fuzzy set theory (Zadeh, 1965). These models provide a system point of view of how to retrieve documents that are sufficiently relevant that they satisfy a user’s information need. On the other hand, an information retrieval system can be thought of as simulating the human user when making relevance decisions in the retrieval process (Bollmann & Wong, 1987). In this case, the ranking of the relevance of the documents to the user’s information need is in terms of preferences (Yao & Wong, 1991).

In this work, we simulate human relevance decision making in the development of a novel retrieval model that explicitly models a human relevance decision at each location in a document. The relevance decision at the specified location in the document is based on the context at that location so that the relevance decision preference (or context score) at the specified location is estimated using the context at that location. Although using contexts in documents to explore term co-occurrence relationships for query expansion is not new, to the best of our knowledge, it is new to model the contexts/windows features explicitly in the retrieval model by incorporating the locations of terms inside a document for re-weighting the query terms. By re-weighting the query terms using the contexts of the query terms in documents, the model assigns context dependent term weights which are aggregated together as the final document similarity score.

A document context is essentially a concordance or a keyword in context (KWIC) (Kupiec, Pedersen, & Chen, 1995). Fig. 1 shows some example document contexts containing a query term in the title query, “Hubble Telescope Achievements”. The contexts were extracted from a raw (un-processed) document. During retrieval, unlike Fig. 1, all the terms are stemmed and the stop words are removed. From Fig. 1, it should be noted that even for a relevant document, not all contexts in the document are relevant.

Our novel model uses current successful retrieval models and techniques to estimate the relevance decision preferences (or context scores) of document contexts containing a query term in the center. The relevance decision preferences are defined as the log-odds estimated using smoothing techniques and they are combined using aggregation operators. More specifically, we used the technique of smoothing (Chen and Goodman, 1996, Zhai and Lafferty, 2004) to solve the problem of zero probabilities (Ponte & Croft, 1998) in estimating the term distributions in relevant documents similar to the language models (Lafferty and Zhai, 2001, Lavrenko and Croft, 2001, Ponte and Croft, 1998). We calculated the probability of the relevance of a particular document context similar to that of the BIM model (Robertson & Sparck-Jones, 1976). In order to calculate the document score for ranking, the document-context log-odds are combined using different evidence aggregation operators based on the extended Boolean model (Salton et al., 1983) and some fuzzy (aggregation) operators (Dombi, 1982, Paice, 1984, Yager, 1988). Therefore, our proposed retrieval model is a hybrid of various past successful retrieval models and techniques.

In predictive experiments, a major source of difficulty in developing novel retrieval models is in determining whether the effectiveness performance is limited by the underlying model or by the poor parameter estimation techniques used. Instead of predictive experiments, we propose to evaluate our novel retrieval model based on retrospective experiments that are performed using relevance information (e.g., the TREC relevance judgments), similar to the retrospective experiments in Robertson and Sparck-Jones, 1976, Sparck-Jones et al., 2000, Hiemstra and Robertson, 2001. The purpose of the retrospective experiments is to

  • (a)

    evaluate the potential of the underlying novel retrieval model by observing the best effectiveness that can be attained by the model;

  • (b)

    reveal the (near) optimal performance of the model and provide a yardstick for future (predictive) experiments. In the probability ranking principle (Robertson, 1977), full relevance information can enable the model obtain optimal performance (Hiemstra & Robertson, 2001);

  • (c)

    focus on gathering crucial factors (e.g., the size of the context) affecting the performance of the model when using the context of query terms in a document. We gather statistical data on these factors for analysing and designing the model to operate in predictive experiments;

  • (d)

    show whether the model obeys the probability ranking principle (Robertson, 1977); and

  • (e)

    examine the relevance decision principles in Kong, Luk, Lam, Ho, and Chung (2004) and determine which is the most suitable in simulating the human user when making relevance decisions.

The problem of estimating parameters with limited or no relevance information is left for future work since it is not known whether the proposed model is worth further investigation. When considering the terms in relevant documents, we discard those terms with document frequency equals to one. This avoids finding identifiers (e.g., document id) that uniquely identify relevant documents, thereby guaranteeing to obtain high precision when the relevance information is present. By contrast, we would like to utilize the term distributions in relevant and irrelevant documents for retrieval.

We emphasize that our document-context based model is a descriptive model in this paper even though it could become a normative model. A descriptive model describes how the decision is made while a normative model specifies how the (optimal) decision should be made. Our document-context based model is descriptive in this paper because it does not feedback any effectiveness performance information (e.g., MAP) to the system for performance optimization (e.g., query optimization (Buckley & Harman, 2003) or model parameter optimization). For instance, our retrieval model directly estimates the probabilities without any effectiveness feedback about the estimation being good or not for document ranking. Also, the retrieval process of our model is a one-pass re-ranking process using the proposed ranking formula (discussed in detail in Section 2) that describes how the relevance decision is made.

One may argue that if we know the relevance information, then the retrieval effectiveness performance must be good and it is pointless to do the experiments. However, as mentioned above, we are not finding identifiers of relevant documents (terms with document frequency equals to one are ignored). The descriptive model does not optimize the query or the model parameters using effectiveness performance results from previous runs. Moreover, the retrieval performance is not guaranteed to be good even when we know the relevance information (e.g., in Hiemstra & Robertson (2001), the performances of the retrospective experiments are similar to those in the predictive experiments (Robertson & Walker, 1999)). This is because the terms in the relevant documents may also appear in the irrelevant documents. By using the relevance information, we are not manipulating or restricting the term distributions/occurrences in the documents but using existing probabilistic methods to estimate the term distributions/occurrences in the documents. Furthermore, we tested our model with different document collections (TREC-2, TREC-6, TREC-7, TREC-2005 and NTCIR-5) to show that the model is reliable. Finally, doing the retrospective experiments also provides us with an important clue about the potential of the retrieval model because an applicable model should perform well in the presence of relevance information. The use of relevance information can reveal the (near) optimal performance and the estimation of the relevance information is possible using various techniques such as pseudo relevance feedback.

We will not examine the time-efficiency of our retrieval model or retrieval system because:

  • (a)

    it is already very challenging to design and develop a highly effective retrieval model;

  • (b)

    once the effective retrieval model is developed, then we have enough information to design and develop (novel) index structures to support such an effective model;

  • (c)

    the time-efficiency problem may reduce its significance in time as computers are continually becoming more and more powerful.

We leave how to make our retrieval model more time-efficient to our future investigation.

The rest of the paper is organized as follows. Section 2 presents the details of our document-context based retrieval model. Section 3 shows the results of the model-oriented experiments which test the model extensively using one data collection, TREC-6. Section 4 shows the results of the scope-oriented experiments which test the model across different data collections and with another language. Section 5 discusses related works. Finally, Section 6 concludes and describes the future work.

Section snippets

Document-context based retrieval model

In this section, we introduce our document-context based retrieval model that ranks documents on the basis of the contexts of the query terms in documents (i.e., document contexts). A document context is uniquely identified by the location where the query term occurs in the document. Therefore, assigning different term weights to the same query term in different contexts can be thought of as assigning different term weights to the same query term in different locations in the document. Hence,

Model-oriented experiments

In this section, we present the results of the model-oriented experiments which extensively investigate the factors affecting the effectiveness of the model using the TREC-6 ad-hoc collection. This collection contains 556,077 English documents. We use the TREC-6 title (short) queries 301–350 in the experiments. Title queries are used because they have few (1–4) query terms which are similar to the lengths of web queries. All the terms in the documents and queries are stemmed using the Porter

Scope-oriented experiments

In this last set of experiments, we test the reliability of the proposed model by experimenting it with different data collections (the ad-hoc retrieval of TREC-2, TREC-6, TREC-7 and the robust-track retrieval of TREC-2005) and another language (Chinese NTCIR-5). Similar to the experiments in Section 3, title (short) queries in each of collections are used as they are commonly found in web search. The performance of TREC-6 has been evaluated in the previous section (Section 3) and the TREC-7

Related work

Vechtomova and Robertson (2000) presented a method of combining corpus-derived data on word co-occurrences with the probabilistic model of information retrieval. Significant collocates are selected using a window-based technique around the node terms. Mutual information (MI) scores and Z statistics were used to filter significant associated collocates. The notion of context in our proposed model is very similar to the window-based technique used by Vechtomova and Robertson. However, our model

Summary and future work

In summary, we proposed a novel hybrid document-context retrieval model which uses existing successful techniques to explore the effectiveness of incorporating term locations inside a document into our retrieval model. We used the log-odds as based on by the well known BIM retrieval model (Robertson & Sparck-Jones, 1976) as the starting point for deriving our document-context based model. We extended the existing probabilistic model from the document level to the document-context level, in

Acknowledgements

This work is supported by the CERG Project # PolyU 5183/03E. Robert thanks the Center for Intelligent Information Retrieval, University of Massachusetts (UMASS), for facilitating him to develop in part the basic IR system, when he was on leave at UMASS.

References (62)

  • C. Burgess et al.

    Explorations in context space: words, sentences, discourse

    Discourse Processes

    (1998)
  • Callan, J. P. (1994). Passage-level evidence in document retrieval. In Proceedings of ACM SIGIR 1994 (pp....
  • Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modelling. In Proceedings of...
  • W.S. Cooper

    Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval

    ACM Transactions on Information Systems (TOIS)

    (1995)
  • Cooper, W. S., Gey, F. C., & Dabney, D. P. (1992). Probabilistic retrieval model based on staged logistic regression....
  • S. Dominich

    A unified mathematical definition of classical information retrieval

    Journal of the American Society for Information Science

    (2000)
  • E. Fox et al.

    Extended Boolean Models

    Information Retrieval: Data Structures & Algorithms

    (1992)
  • Gao, J., Nie, J. Y., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In Proceedings of...
  • Harman, D. (2004). Private communication (at...
  • S.P. Harter

    A probabilistic approach to automatic keyword indexing (part 1)

    Journal of the American Society for Information Science

    (1975)
  • Hiemstra, D., & Robertson, S. E. (2001). Relevance feedback for best match term weighting algorithms in information...
  • H. Jeffreys

    Theory of probability

    (1948)
  • Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings...
  • W.E. Johnson

    Probability: deductive and inductive problems

    Mind

    (1932)
  • Kaszkiel, M., & Zobel, J. (1997). Passage retrieval revisited. In Proceedings of ACM SIGIR 1997 (pp....
  • M. Kaszkiel et al.

    Efficient passage ranking for document databases

    ACM Transaction of Information Systems (TOIS)

    (1999)
  • Kishida, K., Chen, K. -H., Lee, S., Kuriyama, K., Kando, N., Chen, H-H., et al. (2005). Overview of CLIR task at the...
  • Kong, Y. K., Luk, R. W. P., Lam, W., Ho, K. S., & Chung, F. L. (2004). Passage-based retrieval based on parameterized...
  • Kupiec, J., Pedersen, J., & Chen, F. (1995). A trainable document summarizer. In Proceedings of ACM SIGIR 1995 (pp....
  • Lafferty, J., & Zhai, C. (2001). Document language models, query models and risk minimization for information...
  • J. Lafferty et al.

    Probabilistic relevance models based on document and query generation

  • View full text